Vous êtes sur la page 1sur 207

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Table of Contents

Message from KMUTNB President ii Message from General Chair .. iii Conference Organizers iv Conference Organization Committee . v Technical Program Committee ... vi Keynote Speaker . vii Technical Program Contents ... x Invited Papers .. 1 Regular Papers 8 Author Index ... 186

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Message from KMUTNB President

Nowadays, it is generally accepted that the development of a nation stems from technical advancements and this has become the main key factor which dictates the development of any country. Many issues can affect development such as international economics, highly competitive markets, social and cultural differences, and global environmental problems. Thus, strengthening a countrys capability to gain knowledge both physically and mentally provides the ability to deal with any critical issues, not only in an urban society but also surrounding countryside societies too. To ensure the overall stability of a country in the long term, technology is considered one of the most important mechanisms to support the management of educational quality and to provide the ability to continually improve it. This can be seen from many civilized countries that have invested resources and fundamental infrastructures for the deployment of Information Technology to their education system. The development of innovative teaching and learning that focuses on bringing Information Technology to the forefront of education benefits the population as a whole. The main goal is to improve the quality of life and progress evenly and equally to a better future. I would like to say a special thank you to everybody involved in this conference, from partners to stakeholders, without you this would not be possible. And I hope this conference provides a good opportunity for all your voices to be heard.

(Professor Dr.Teravuti Boonyasopon) President, King Mongkuts University of Technology North Bangkok

ii

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Message from General Chair

Some of the most dramatic changes in the world are caused by the development of technology which continually changes our daily lives. The need for education, research, and development is necessary to understand the modem world we are now a part of. In particular, the computer and information technology field. The Faculty of Information Technology, KMUTNB will hold the 8th Conference in Computing and Information Technology between the 9th-10th of May 2012 at the Dusit Thani Hotel, Pattaya City, with the goal to serve as a platform to publish the findings of academic research in the field of Computers and Information Technology from students, professors, researchers, and general public. By the cooperation of local and international institutions including Fern University in Hagen (Germany), Oklahoma State University (USA), Chemnitz University of Technology (Germany), Edith Cowan University (Australia), National Taiwan University (Taiwan), Hanoi National University of Education (Vietnam), Nakhon Pathom Rajabhat University, Kanchanaburi Rajabhat University, Siam University, and Ubol Ratchathani University. Thank you to the president, CEO of King Mongkuts University of Technology North Bangkok, and all involved organizations and committees who support and drive this conference to be successful.

(Associate Professor Dr.Monchai Tiantong) General Chair

iii

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Conference Organizers

King Mongkuts University of Technology North Bangkok, Thailand Fern University in Hagen, Germany Oklahoma State University, USA Chemnitz University of Technology, Germany

Edith Cowan University, Australia

National Taiwan University, Taiwan

Hanoi National University of Education, Vietnam

Mahasarakham University, Thailand

Kanchanaburi Rajabhat University, Thailand Siam University, Thailand Nakhon Pathom Rajabhat University, Thailand Ubon Ratchathani University, Thailand

iv

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Conference Organization Committee

General Chair : Assoc.Professor Dr.Monchai Tiantong King Mongkuts University of Technology North Bangkok Technical Program Chair : Prof.Dr.Herwig Unger Fern University in Hagen, Germany Conference Treasurer : Assist.Prof.Dr.Supot Nitsuwat King Mongkuts University of Technology North Bangkok Secretary and Publication Chair : Assist.Prof.Dr.Phayung Meesad King Mongkuts University of Technology North Bangkok

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Technical Program Committee

Alain Bui, Uni Paris 8, France Alisa Kongthon, NECTEC, Thailand Anirach Mingkhwan, KMUTNB, Thailand Apiruck Preechayasomboon, TOT, Thailand Armin Mikler, University of North Texas, USA Atchara Masaweerawat, UBU, Thailand Banatus Soiraya, Thailand Bogdan Lent, Lent AG, Switzerland Chatchawin Namman, UBU, Thailand Chayakorn Netramai, KMUTNB, Thailand Cholatip Yawut, KMUTNB, Thailand Choochart Haruechaiyasa, NECTEC, Thailand Claudio Ramirez, USL, Mexico Craig Valli, ECU, Australia Dietmar Tutsch, Wuppertal, Germany Doy Sundarasaradula, TOT, Thailand Dursun Delen, OSU, USA Gerald Eichler, Telecom, Germany Gerald Quirchmayr, UNIVIE, Austria Hsin-mu Tsai, NTU, Taiwan Ho Cam Ha, HNUE, Vietnam Jamornkul Laokietkul,CRU, Thailand Janusz Kacprzyk, Polish Academy of Science, Poland Jie Lu, Univ. of Technology, Sydney, Australia Kairung Hengpraphrom, NPRU, Thailand Kamol Limtunyakul, KMUTNB, Thailand Kriengsak Treeprapin, UBU, Thailand Kunpong Voraratpunya, KMITL, Thailand Kyandoghere Kyamakya, Klagenfurt, Austria Maleerat Sodanil, KMUTNB, Thailand Marco Aiello, Groningen, The Netherlands Mark Weiser, OSU, USA Martin Hagan, OSU, USA Mirko Caspar, Chemnitz, Germany Nadh Ditcharoen, UBU, Thailand Nawaporn Visitpongpun, KMUTNB, Thailand Nattavee Utakrit, KMUTNB, Thailand Nguyen The Loc, HNUE, Vietnam Nalinpat Porrawatpreyakorn, UNIVIE, Austria Padej Phomsakha Na Sakonnakorn, Thailand Parinya Sanguansat, PIM, Thailand Passakon Prathombutr, NECTEC, Thailand Peter Kropf, Neuchatel, Switzerland Phayung Meesad, KMUTNB, Thailand

Prasong Praneetpolgrang, SPU, Thailand Roman Gumzej, University of Maribor, Slovenia Saowaphak Sasanus, TOT, Thailand Sirapat Boonkrong, KMUTNB, Thailand Somchai Prakarncharoen, KMUTNB, Thailand Soradech Krootjohn, KMUTNB, Thailand Suksaeng Kukanok, Thailand Surapan Yimman, KMUTNB, Thailand Sumitra Nuanmeesri, SSRU, Thailand Sunantha Sodsee, KMUTNB, Thailand Supot Nitsuwat, KMUTNB, Thailand Taweesak Ganjanasuwan, Thailand Tong Srikhacha, TOT, Thailand Tossaporn Joochim, UBU, Thailand Thibault Bernard, Uni Reims, France Thippaya Chintakovid, KMUTNB, Thailand Tobias Eggendorfer, Hamburg, Germany Thomas Bhme, TU Ilmenau, Germany Thomas Tilli, Telecom, Germany Dang Hung Tran, HNUE, Vietnam Ulrike Lechner, UniBw, Germany Uraiwan Inyaem, RMUTT, Thailand Wallace Tang, CityU Hongkong Wolfram Hardt, Chemnitz, Germany Winai Bodhisuwan, KU, Thailand Wongot Sriurai, UBU, Thailand Woraniti Limpakorn, TOT, Thailand

vi

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Keynote Speaker

Professor Dr. Martin Hagan School of Electrical and Computer Engineering Oklahoma State University, USA

Topic : Dynamic Neural Networks : What Are They, and How Can We Use Them? Abstract : Neural networks can be classified into static and dynamic categories. In static networks, which are more commonly used, the output of the network is computed uniquely from the current inputs to the network. In dynamic networks, the output is also a function of past inputs, outputs or states of the network. This talk will address the theory and applications of this interesting class of neural network. Dynamic networks have memory, and therefore they can be trained to learn sequential or time-varying patterns. This has applications in such disparate areas as control systems, prediction in financial markets, channel equalization in communication systems, phase detection in power systems, sorting, fault detection, speech recognition, and even the prediction of protein structure in genetics. These dynamic networks are generally trained using gradient-based (steepest descent, conjugate gradient, etc.) or Jacobian-based (Gauss-Newton, LevenbergMarquardt, Extended Kalman filter, etc.) optimization algorithms. The methods for computing the gradients and Jacobians fall generally into two categories: real time recurrent learning (RTRL) or backpropagation through time (BPTT). In this talk we will present a unified view of the training of dynamic networks. We will begin with a very general framework for representing dynamic networks and will demonstrate how BPTT and RTRL algorithms can be efficiently developed using this framework. While dynamic networks are more powerful than static networks, it has been known for some time that they are more difficult to train. In this talk, we will also investigate the error surfaces for these dynamic networks, which will provide interesting insights into the difficulty of dynamic network training.

vii

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Keynote Speaker

Prof. Dr. rer.nat. Ulrike Lechner Institut fr Angewandte Informatik Fakultt fr Informatik Universitt der Bundeswehr Mnchen Germany

Topic : Innovation Management and the IT-Industry Enabler of innovations or truly innovative? Abstract : Who wants to be innovative? Who needs to be innovative? Everybody? - Innovation seems to be paramount in todays economy and IT is an important driver for innovation. Think of EBusiness and all the consumer electronics. Can it be safely assumed that this industry is innovative and masters the art and science of innovation management? What about important business model trends of today, outsourcing and cloud computing, or the bread-and-butter business of the many IT-consulting companies? How important is innovation to them and how do they master innovations and do innovation management? Empirical data is rather inconclusive about business model innovations of the IT-industry and the need to be innovative. The talk reports on experiences in innovation management in the IT-industry and discusses awareness of innovation, innovation in services vs. product innovations and the various options to design innovation management. It provides an overview of the innovation landscape and ecosystems in the IT-industry as well as the theoretical background to analyze innovation networks. The talk discusses scientific approaches and open questions in this field.

viii

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Keynote Speaker

Dr. Hsin-Mu Tsai Department of Computer Science and Information Engineering National Taiwan University, Taiwan

Topic : Extend the Safety Shield - Building the Next Generation Vehicle Safety System Abstract : For the past decade, various safety systems have been realized by the car manufacturers to reduce the number of accidents drastically. However, the conventional approach has limited the classes of risks which can be detected and handled by the safety systems to those which have a line-of-sight path to the sensors installed on vehicles. In this talk, I will propose the next-generation vehicle safety system, which utilizes two fundamental technologies. The system gives out warnings for the vehicle or the driver to react to potential risks in a timely manner and extends the classes of risks which can be detected by vehicles from only risks which have appeared to also risks not yet appear. Conceptually, this increases the size of the safety shield of the vehicle, since most accidents caused by detectable risks could be avoided. I will also present the related research challenges in implementing such a system and some preliminary results from the measurements we carried out at National Taiwan University.

ix

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Technical Program Contents

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Wednesday May 9, 2012 8:00-9:00 9:00-9:30 Registration Opening Ceremony by Prof.Dr. Teravuti Boonyasopon, President of King Mongkuts University of Technology North Bangkok Invited Keynote Speech by Prof.Dr. Martin Hagan, Oklahoma State University, USA Topic: Dynamic Neural Network: What are they, and How can we use them? Coffee Break Invited Keynote Speech by Prof.Dr.rer.nat. Ulrike Lechner, Universitt der Bundeswehr Mnchen, Germany Topic: Innovation Management and The IT-Industry-Enabler of innovations or truly innovative? Lunch Parallel Session Presentation Welcome Dinner

9:30-10:30 10:30-11:00 11:00-12:00 12:00-13:00 13:00-18:00 18:00-22:00

IC2IT 2012 Session I Network & Security and Fuzzy Logic Chair Session: Dr. Nawaporn Wisitpongphan Time/Paper-ID Title/Author Improving VPN Security Performance Based on One-Time Password Technique Using Quantum Keys Montida Pattaranantakul, Paramin Sangwongngam, and Keattisak Sripimanwat Experimental Results on the Reloading Wave Mechanism for Randomized Token Circulation Boukary Ouedraogo, Thibault Bernard, and Alain Bui Statistical-Based Car Following Model for Realistic Simulation of Wireless Vehicular Networks Kitipong Tansriwong and Phongsak Keeratiwintakorn Rainfall Prediction in the Northeast Region of Thailand Using Cooperative Neuro-Fuzzy Technique Jesada Kajornrit, Kok Wai Wong, and Chun Che Fung Page

13:00-13:20 IC2IT2012-71

13:20-13:40 IC2IT2012-33

14

13:40-14:00 IC2IT2012-107

19

14:00-14:20 IC2IT2012-34

24

14:20-14:40 IC2IT2012-46

Interval-Valued Intuitionistic Fuzzy ELECTRE Method Ming-Che Wu and Ting-Yu Chen

30

14:40-15:00

Coffee Break

xi

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

IC2IT 2012 Session II Fuzzy Logic, Neural Network, and Recommendation Systems Chair Session: Dr. Maleerat Sodanil Time/Paper-ID Title/Author Optimizing of Interval Type-2 Fuzzy Logic Systems Using Hybrid Heruistic Algorithm Evaluated by Classification Adisak Sangsongfa and Phayung Meesad Neural Network Modeling for an Intelligent Recommendation System Supporting SRM for Universities in Thailand Kanokwan Kongsakun, Jesada Kajornrit, and Chun Che Fung Recommendation and Application of Fault Tolerance Patterns to Services Tunyathorn Leelawatcharamas and Twittie Senivongse Development of Experience Base Ontology to Increase Competency of Semi-Automated ICD-10-TM Coding System Wansa Paoin and Supot Nitsuwat Break IC2IT 2012 Session III Natural Language Processing and Machine Translation Chair Session: Dr. Maleerat Sodanil Collacation-Based Term Prediction for Academic Writing Narisara Nakmaetee, Maleerat Sodanil, and Choochart Haruechaiyasak Page

15.00-15.20 IC2IT2012-81

36

15.20-15.40 IC2IT2012-60

42

15.40-16.00 IC2IT2012-44

48

16.00-16.20 IC2IT2012-43

54

16:20-16:30

16:30-16:50 IC2IT2012-110

58

16:50-17:10 IC2IT2012-65

Thai Poetry in Machine Translation Sajjaporn Waijanya and Anirach Mingkhwan

64

17:10-17:30 IC2IT2012-45

Keyword Recommendation for Academic Publication Using Flexible N-gram Rugpong Grachangpun, Maleerat Sodanil, and Choochart Haruechaiyasak Using Example-Based Machine Translation for English Vietnamese Translation Minh Quang Nguyen, Dang Hung Tran, and Thi Anh Le Pham Welcome Dinner

70

17:30-17:50 IC2IT2012-70

75

18:00-22:00

xii

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Thursday May 10, 2012 8:00-9:00 9:00-10:00 10:00-10:20 10:20-12:00 12:00-13:00 13:00-18:00 Registration Invited Keynote Speech by Dr. Hsin-Mu Tsai, National Taiwan University, Taiwan Topic: Extend the Safety Shield - Building the Next Generation Vehicle Safety System Coffee Break Parallel Session Presentation Lunch Parallel Session Presentation

IC2IT 2012 Session IV Image Processing, Web Mining, Clustering, and e-Business Chair Session: Prof.Dr. Herwig Unger Time/Paper-ID Title/Author Cross-Ratio Analysis for Building up the Robustness of Document Image Watermark Wiyada Yawai and Nualsawat Hiransakolwong PCA Based Handwritten Character Recognition System Using Support Vector Machine & Neural Network Ravi Sheth and Kinjal Mehta Web Mining Using Concept-Based Pattern Taxonomy Model Sheng-Tang Wu, Yuefeng Li, and Yung-Chang Lin Page

10:20-10:40 IC2IT2012-57

81

10:40-11:00 IC2IT2012-73

87

11:00-11:20 IC2IT2012-68

92

11:20-11:40 IC2IT2012-59

A New Approach to Cluster Visualization Methods Based on Self-Organizing Maps Marcin Zimniak, Johannes Fliege, and Wolfgang Benn

98

11:40-12:00 IC2IT2012-74

Detecting Source Topics Using Extended HITS Mario Kubek and Herwig Unger

104

12:00-13:00

Lunch

xiii

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

IC2IT 2012 Session V Evolutionary Algorithm, Heuristic Search, and Graphics Processing & Representation Chair Session: Dr. Sunantha Sodsee Time/Paper-ID 13:00-13:20 IC2IT2012-91 Title/Author Blended Value Based e-Business Modeling Approach: A Sustainable Approach Using QFD Mohammed Dewan and Mohammed Quaddus Protein Structure Prediction in 2D Triangular Lattice Model Using Differential Evolution Algorithm Aditya Narayan Hati, Nanda Dulal Jana, Sayantan Mandal, and Jaya Sil Elimination of Materializations from Left/Right Deep Data Integration Plans Janusz Getta A Variable Neighbourhood Search Heuristic for the Design of Codes Roberto Montemanni, Matteo Salani, Derek H. Smith, and Francis Hunt Spatial Join with R-Tree on Graphics Processing Units Tongjai Yampaka and Prabhas Chongstitwattana Coffee Break IC2IT 2012 Session VI Web Services, and Ontology, and Agents Chair Session: Dr. Sucha smanchat Ontology Driven Conceptual Graph Representation of Natural Language Supriyo Ghosh, Prajna Devi Upadhyay, and Animesh Dutta Page

109

13:20-13:40 IC2IT2012-94

116

13:40-14:00 IC2IT2012-48

121

14:00-14:20 IC2IT2012-24

127

14:20-14:40 IC2IT2012-63 14:40-15:00

133

15:00-15:20 IC2IT2012-41

138

15:20-15:40 IC2IT2012-88

Web Services Privacy Measurement Based on Privacy Policy and Sensitivity Level of Personal Information Punyaphat Chaiwongsa and Twittie Senivongse

145

15:40-16:00 IC2IT2012-64

Measuring Granularity of Web Services with Semantic Annotation Nuttida Muchalintamolee and Twittie Senivongse

151

16:00-16:20 IC2IT2012-83

Decomposing Ontology in Description Logics by Graph Partitioning Pham Thi Anh Le, Le Thanh Nhan, and Nguyen Minh Quang Break

157

16:20-16:30

xiv

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Time/Paper-ID

Title/Author An Ontological Analysis of Common Research Interest for Researchers Nawarat Kamsiang and Twittie Senivongse

Page

16:30-16:50 IC2IT2012-49

163

16:50-17:10 IC2IT2012-36

Automated Software Development Methodology: An Agent Oriented Approach Prajna Devi Upadhyay, Sudipta Acharya, and Animesh Dutta

169

17:10-17:30 IC2IT2012-53

Agent Based Computing Environment for Accessing Privileged Services Navin Agarwal and Animesh Dutta An Interactive Multi-touch Teaching Innovation for Preschool Mathematical Skills Suparawadee Trongtortam, Peraphon Sophatsathit, and Achara Chandrachai

176

17:30-17:50 IC2IT2012-52

181

xv

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Dynamic Neural Networks: What Are They, and How Can We Use Them?
Martin Hagan
School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, Oklahoma, 74078 mhagan@ieee.org

AbstractNeural networks can be classified into static and dynamic categories. In static networks, which are more commonly used, the output of the network is computed uniquely from the current inputs to the network. In dynamic networks, the output is also a function of past inputs, outputs or states of the network. This paper will address the theory and applications of this interesting class of neural network. Dynamic networks have memory, and therefore they can be trained to learn sequential or time-varying patterns. This has applications in such disparate areas as control systems, prediction in financial markets, channel equalization in communication systems, phase detection in power systems, sorting, fault detection, speech recognition, and even the prediction of protein structure in genetics. While dynamic networks are more powerful than static networks, it has been known for some time that they are more difficult to train. In this paper, we will also investigate the error surfaces for these dynamic networks, which will provide interesting insights into the difficulty of dynamic network training.

There are two general approaches (with many variations) to gradient and Jacobian calculations in dynamic networks: backpropagation-through-time (BPTT) [10] and real-time recurrent learning (RTRL) [11]. In the BPTT algorithm, the network response is computed for all time points, and then the gradient is computed by starting at the last time point and working backwards in time. This algorithm is computationally efficient for the gradient calculation, but it is difficult to implement on-line, because the algorithm works backward in time from the last time step. In the RTRL algorithm, the gradient can be computed at the same time as the network response, since it is computed by starting at the first time point, and then working forward through time. RTRL requires more calculations than BPTT for calculating the gradient, but RTRL allows a convenient framework for on-line implementation. For Jacobian calculations, the RTRL algorithm is generally more efficient than the BPTT algorithm [12,13]. In order to more easily present general BPTT [10, 15] and RTRL [11, 14] algorithms, it will be helpful to introduce modified notation for networks that can have recurrent connections. In Section II, we will introduce this notation and will develop a general dynamic network framework. As a general rule, there have been two major approaches to using dynamic training. The first approach has been to use the general RTRL or BPTT concepts to derive algorithms for particular network architectures. The second approach has been to put a given network architecture into a particular canonical form (e.g., [16-18]), and then to use the dynamic training algorithm which has been previously designed for the canonical form. Our approach is to develop a very general framework in which to conveniently represent a large class of dynamic networks, and then to derive the RTRL and BPTT algorithms for the general framework In Section III, we will demonstrate how this general dynamic framework can be applied to solve many real-world problems. Section IV will present procedures for computing gradients for the general framework. In this way, one computer code can be used to train arbitrarily constructed network architectures, without requiring that each architecture be first converted to a particular canonical form. Finally, Section V describes some complexities in the error surfaces of dynamic

I. INTRODUCTION Dynamic networks are networks that contain delays (or integrators, for continuous-time networks). These dynamic networks can have purely feedforward connections, or they can also have some feedback (recurrent) connections. Dynamic networks have memory. Their response at any given time will depend not only on the current input, but also on the history of the input sequence. Because dynamic networks have memory, they can be trained to learn sequential or time-varying patterns. This has applications in such diverse areas as control of dynamic systems [1], prediction in financial markets [2], channel equalization in communication systems [3], phase detection in power systems [4], sorting [5], fault detection [6], speech recognition [7], learning of grammars in natural languages [8], and even the prediction of protein structure in genetics [9]. Dynamic networks can be trained using standard gradientbased or Jacobian-based optimization methods. However, the gradients and Jacobians that are required for these methods cannot be computed using the standard backpropagation algorithm. In this paper we will discuss a general dynamic network framework, in which dynamic backpropagation algorithms can be efficiently developed.

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

networks, and shows how we can mitigate these complexities to achieve successful training for dynamic networks. II. A GENERAL CLASS OF DYNAMIC NETWORK Our general dynamic network framework is called the Layered Digital Dynamic Network (LDDN) [12]. The fundamental building block for the LDDN is the layer. A layer contains the following components: a set of weight matrices (input weights from external inputs, and layer weights from the outputs of other layers), tapped delay lines that appear at the input of a weight matrix, bias vector, summing junction, transfer function.
Inputs Layer 1 Layer 2 Layer 3

Figure 1. Example Layer

A prototype layer is shown in Fig. 1. The equations that define a layer response are

T D L

LW1,1 n (t)
1

a (t)
T D L
S x1
1

nm (t ) =

lI m dDI m ,l

IW m,l ( d ) p l ( t d ) LW m,l ( d ) a l ( t d ) + b m
(1)
R1

p1(t)
R1x1

a2(t) LW2,1
S xS
2 1

a3(t) LW3,2
S xS
3 2

IW1,1
S xR
1

f1
S1x1

n2(t)
S x1
2

S x1

n3(t)
S x1
3

f3
S3

S3x1

lLm dDLm ,l f

1
T D L

1 S1
T D L

b2
S2x1

1 S2 LW
2,3

b3
S3x1

S1x1

LW

1,3

a m (t ) = f m nm (t )
where m,

(2)
Figure 2. Example Dynamic Network in the LDDN Framework

I m is the set of indices of all inputs that connect to layer


is the set of indices of all layers that connect forward to is the lth input to the network, is the input

layer m.,

weight between input l and layer m, is the layer weight between layer l and layer m, is the bias vector for layer m, DLm,l is the set of all delays in the tapped delay line between Layer l and Layer m, DIm,l is the set of all delays in the tapped delay line between Input l and Layer m. For the LDDN class of networks, we can have multiple weight matrices associated with each layer - some coming from external inputs, and others coming from other layers. An example of a dynamic network in the LDDN framework is shown in Fig. 2. The LDDN framework is quite general. It is equivalent to the class of general ordered networks discussed in [10] and [19]. It is also equivalent to the signal flow graph class of networks used in [15] and [20]. However, we can increase the generality of the LDDN further. In LDDNs, the weight matrix multiplies the corresponding vector coming into the layer (from an external input in the case of IW, and from another layer in the case of LW). This means that a dot product is formed between each row of the weight matrix and the input vector.

We can consider more general weight functions than simply the dot product. For example, radial basis layers compute the distances between the input vector and the rows of the weight matrix. We can allow weight functions with arbitrary (but differentiable) operations between the weight matrix and the input vector. This enables us to include higher-order networks as part of our framework. Another generality we can introduce is for the net input function. This is the function that combines the results of the weight function operations with the bias vector. In LDDNs, the net input function has been a simple summation. We can allow arbitrary, differentiable net input functions to be used. The resulting network framework is the Generalized LDDN (GLDDN). A block diagram for a simple GLDDN (without delays) is shown in Fig. 3. The equations of operation for a GLDDN are Weight Functions:

iz m,l ( t,d ) = ih m,l IW m,l ( d ) ,p l ( t d )

) )

(3) (4)

lz m,l ( t,d ) = lh m,l LW m,l ( d ) ,a l ( t d )

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Net Input Functions:

n m ( t ) = o m iz m,l ( t,d )

lI m dDI m ,l

,lz m,l ( t,d )

f lLm

dDLm ,l

,b m (5)

Transfer Functions:

B. Speech Prediction Predictive coding of speech signals is commonly used for data compression. The standard method has used Linear Predictive Coding (LPC). Neural networks allow the use of nonlinear predictive coding. Fig. 5 shows a pipeline recurrent neural network [21], which can be used for speech prediction. The target output would be the next value of the input sequence.

a m (t ) = f m nm (t )

)
LW2,1

(6)

IW1,1 p1
R x1
1

ih

1,1

iz1,1 o
1
1

a1 n
1

S x1

S x1

lh2,1

lz

2,1

a2 o
2
2

S x1

S2x1

b1
1

b2
2

Figure 5. Speech Predition Network

S x1 S x1 R1 S1 S2 Figure 3. Example Network with General Weight and Net Input Functions

III. APPLICATIONS OF DYNAMIC NETWORKS Dynamic networks have been applied to a wide variety of application areas. In this section, we would like to give just a brief overview of some of these. A. Phase Detection in Power Systems Voltage phase and local frequency deviation are used in disturbance monitoring and control for power systems. Modern power electronic devices introduce complex interharmonics, which make it difficult to extract the phase. The dynamic neural network shown in Fig. 4 has been used [4] to detect phase in power systems. The input to the network is the line voltage:

C. Channel Equalization The performance of a communication system can be seriously impaired by channel effects and noise. These may cause the transmitted signal of one symbol to spread out and overlap successive symbol intervals - commonly termed Intersymbol Interference. Dynamic neural networks, like the one in Fig. 4, can be used to perform channel equalization, to compensate for the effects of the channel [3]. Fig. 6 shows the block diagram of such a system.

p ( t ) = A ( t ) sin ( 2 fct + ( t )) + v ( t ) .
The target output is the phase ( t ) . The equations of operation for the network are

Figure 6. Channel Equalization System

n1 ( t ) =

dDI m ,l

IW1,1 ( d ) p1 ( t d ) + LW1,1 (1) a1 ( t 1) + b1

D. Model Reference Control Dynamic networks are suitable for many types of control systems. Fig. 7 shows the architecture of a model reference control system [14].
T D L T D L

a1 ( t ) = f 1 n1 ( t )

)
r(t) 1

ec(t) LW1,2 n1(t) a2 (t) LW2,1 f1 b1 1 b2


T D L

IW1,1

n2(t) f2

c(t)

Plant
ep(t) a3 (t) LW3,2 n3(t) f32 f 1
T D L

T D L

y(t) n4(t)
4 f4 a (t)

LW1,4

LW4,3

LW3,4

Figure 4. Phase Detection Network for Power Systems

Neural Network Controller

Neural Network Plant Model

Figure 7. Model Reference Control System

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

E. Grammatical Inference Grammars are a way to define languages. They consist of rules that describe how to construct valid strings. Dynamic neural networks can be trained to recognize which strings belong to a language and which dont. Dynamic networks can also perform grammatical inference - learning a grammar from example strings. Fig. 8 shows a dynamic network that can be used for grammatical inference [8]. The error function is defined by a single output neuron. At the end of each string presentation it should be 1 if the string is valid and 0 if not.

A. Preliminary Definitions First, as we stated earlier, a layer consists of a set of weights, associated weight functions, associated tapped delay lines, a net input function, and a transfer function. The network has inputs that are connected to special weights, called input weights. The weights connecting one layer to another are called layer weights. In order to calculate the network response in stages, layer by layer, we need to proceed in the proper layer order, so that the necessary inputs at each layer will be available. This ordering of layers is called the simulation order. In order to backpropagate the derivatives for the gradient calculations, we must proceed in the opposite order, which is called the backpropagation order. In order to simplify the description of the gradient calculation, some layers of the GLDDN will be assigned as network outputs, and some will be assigned as network inputs. A layer is an input layer if it has an input weight, or if it contains any delays with any of its weight matrices. A layer is an output layer if its output will be compared to a target during training, or if it is connected to an input layer through a matrix that has any delays associated with it.

Figure 8. Grammar Inference Network

F. Protein Folding Each gene within the DNA molecule codes for a protein. The amino acid sequence (A,T,G,C) determines the protein structure (e.g., secondary structure = helix, strand, coil). However, the relationship between the sequence and the structure is very complex. In the network in Fig. 9, the sequence is provided at the input to the network, and the output of the network indicates the secondary structure [9].

For example, the LDDN shown in Fig. 2 has two output layers (1 and 3) and two input layers (1 and 2). For this network the simulation order is 1-2-3, and the backpropagation order is 3-2-1. As an aid in later derivations, we will define U as the set of all output layer numbers and X as the set of all input layer numbers. For the LDDN in Fig. 3, U={1,3} and X={1,2}. B. Gradient Calculation The objective of training is to optimize the network performance, quantified in the performance index F(x), where x is a vector containing all of the weights and biases in the network. In this paper we will consider gradient-based algorithms for optimizing the performance (e.g., steepest descent, conjugate gradient, quasi-Newton, etc.). For the RTRL approach, the gradient is computed using

, where
Figure 9. Protein Structure Identification Network

(7)

IV. GRADIENT CALCULATION FOR THE GLDDN Dynamic networks are generally trained with a gradient or Jacobian-based algorithm. In this section we describe an algorithm for computing the gradient for the GLDDN. This can be done using the BPTT or the RTRL approaches. Because of limited space, we will describe only the RTRL algorithm in this paper. (Both approaches are described for the LDDN framework in [12].) To explain the gradient calculation for the GLDDN, we must create certain definitions. We do that in the following paragraphs.

a u ( t ) e a u ( t ) = xT xT +
u xX dDLx ,u U

n x ( t ) a u ( t d )
T

e au (t )

e n x (t )

a u ( t d ) xT

(8)

The superscript e in these expressions indicates an explicit derivative, not accounting for indirect effects through time. Many of the terms in Eq. 8 will be zero and need not be included. To take advantage of these efficiencies, we introduce the following definitions

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

U ELW ( x ) = u U LW x,u 0

)} ,

(9) (10)

ESX ( u ) = x X S x,u 0

)} ,

where

The first term on the right of Eq. 16 is the derivative of the net input function, which is the identity matrix if the net input is the standard summation. The second term is the derivative of the weight function, which is the corresponding weight matrix if the weight function is the standard dot product. Therefore, the right side of Eq. 16 becomes simply a weight matrix for LDDN networks. The second term in Eq. 15 is the same as the first term on the right of Eq. 16. It is the derivative of the net input function. The third term in Eq. 15 is the same as the second term on the right of Eq. 16. It is the derivative of the weight function. The final term that we need to compute is the last term in Eq. 15, which is the explicit derivative of the network outputs with respect to the weights and biases in the network. One element of that matrix can be written

(11) is the sensitivity matrix. Using these definitions, we can rewrite Eq. 8 as

a u ( t ) e a u ( t ) = xT xT +
X xES ( u )

S x,u ( t )

u EU ( x ) dDLx ,u LW

a u ( t d )

e n x (t )

a u ( t d ) xT

(12) (17) The first term in this summation is an element of the sensitivity matrix, which is computed using Eq. 13. The second term is the derivative of the net input, and the third term is the derivative of the weight function. (We have made the assumption here that the net input function operates on each element individually.) Eq. 17 is the equation for an input weight. Layer weights and biases would have similar equations. This completes the RTRL algorithm for networks that can be represented in the GLDDN framework. The main steps of the algorithm are Eqs. 7 and 12, where the components of Eq. 12 are computed using Eqs. 16 and 17. Computer code can be written from these equations, with modules for weight functions, net input functions and transfer functions added as needed. Each module should define the function response, as well as its derivative. The overall framework is independent of the particular form of these modules. V. TRAINING DIFFICULTIES FOR DYNAMIC NETWORKS From the previous section on dynamic network applications, it is clear that these types of networks are very powerful and have many uses. However, they have not yet been adopted comprehensively. The main reason for this is the difficulty in training these types of networks. The reasons for these difficulties are not completely understood, but it has been shown that one of the reasons is the existence of spurious valleys in the error surfaces of these networks. In this section, we will provide a quick overview of the causes of these spurious valleys and suggestions for mitigating their effects. Fig. 10 shows an example of spurious valleys in the error surface of a neural network model reference controller (as shown in Fig. 7). In this particular example, the network had 65 weights. The plot shows the error surface along the direction of search in a particular iteration of a quasi-Newton optimization algorithm. It is clear from this profile that any standard line search, using a combination of interpolation and sectioning, will have great difficulty in locating the minimum along the search direction. There are many local minima contained in

The sensitivity matrix can be computed using static backpropagation, since it describes derivatives through a static portion of the network. The static backpropagation equation is

e nl ( t ) e lz l,m ( t ) m m Su,m ( t ) = Su,l ( t ) l,m n (t ) , T T F lz ( t,0 ) a m ( t ) lLm ES (u ) b

u U ,

(13)

where m is decremented from u through the backpropagation order, Lm is the set of indices of layers that are directly b connected backwards to layer m (or to which layer m connects forward) and that contain no delays in the connection, and

(14)

There are four terms in Eqs. 12 and 13 that need to be computed:

, and

(15)

The first term can be expanded as follows:

(16)

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

very narrow valleys. In addition, the bottoms of the valleys are often cusps. Even if our line search were to locate the minimum, it is not clear that the minimum represents an optimal weight location. In fact, in the remainder of this section, we will demonstrate that spurious minima are introduced into the error surface due to characteristics of the input sequence.
5

constant and equal to the mean square value of the target outputs.

Sum Square Error

10

20

Sum Squared Error

10 5 10 10
0

0 5 0 5 5 10 10

Figure 12. Single Neuron Network Error Surface

Distance Along Search Direction


Figure 10. Example of Spurious Valleys

To understand where the second valley comes from, consider the network response equation:

a ( t + 1) = w1 p ( t ) + w2 a ( t )
If we iterate this equation from the initial condition a(0), we get

In order to understand the spurious valleys in the error surfaces of dynamic networks it is best to start with the simplest network for which such valleys will appear. We have found that these valleys even appear in a linear network with one neuron, as shown in Fig. 11.

a ( t ) = w1 {p ( t ) + w2 p ( t 1) + ( w2 ) p ( t 2 ) +
2

+ ( w2 )

t1

p (1)} + ( w2 ) a ( 0 )
t

Here we can see that the response at time t is a polynomial in the parameter w2 . (It will be a polynomial of degree t-1, if the initial condition is zero.) The coefficients of the polynomial involve the input sequence and the initial condition. We obtain the second valley because this polynomial contains a root outside the unit circle. There is some value of w2 that is larger than 1 in magnitude for which the output is almost zero. Of course, having a single output close to zero would not produce a valley in the error surface. However, we discovered that once the polynomial shown above has a root outside the unit circle at time t, that same root also appears in the next polynomial at time t+1, and therefore, the output will remain small for all future times for the same weight value. Fig. 13 shows a cross section of the error surface presented in Fig. 12 for w1=0.5 using different sequence lengths. The error falls abruptly near w2=-3.8239. That is the root of the polynomial described above. The root maintains its location as the sequence increases in length. This causes the valley in the error surface. We have since studied more complex networks, with nonlinear transfer functions and multiple layers. The number of spurious valleys increases in these cases, and they become more complex. However the causes of the valleys remain similar. They are affected by initial conditions and roots of the

Figure 11. Single Neuron Recurrent Network

In order to generate an error surface, we first develop training data using the network of Fig. 11, where both weights are set to 0.5. We use a Gaussian white noise input sequence with mean zero and variance one for p(t), and then use the network to generate a sequence of outputs. In Fig. 12 we see a typical error surface, as the two weights are varied. Although this network architecture is simple, the error surfaces generated by these networks have spurious valleys similar to those encountered in more complicated networks. The two valleys in the error surface occur for two different reasons. One valley occurs along the line w1=0. If this weight is zero, and the initial condition is zero, the output of the network will remain zero. Therefore, our mean squared error will be

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

input sequence (or subsequence). This leads to several procedures for improving the training for these networks.

[2]

[3]
Log Sum Square Error

[4]

[5]

[6]

w2

The first training modification is to switch training sequences often during training. If training is becoming trapped in a spurious valley, the valley will move when the training sequence is changed. Also, since some of the valleys are affected by the choice of initial condition, a second modification is to use small random initial conditions for neuron outputs and change them periodically during training. A further modification is to use a regularized performance index to force weights into the stable region. Since the deep valleys occur in regions where the network is unstable, we can avoid the valleys by maintaining a stable network. We generally decay the regularization factor during training, so that the final weights will not be biased. VI. CONCLUSIONS Dynamic neural networks represent a very powerful paradigm, and, as we have shown in this paper, they have a very wide variety of applications. However, they have not been as widely implemented as their power would suggest. The reason for this discrepancy is related to the difficulties in training these networks. The first obstacle in dynamic network training is the calculation of training gradients. In most cases, the gradient algorithm is custom designed for a specific network architecture, based on the general concepts of BPTT or RTRL. This creates a barrier to using dynamic networks. We propose a general dynamic network framework, the GLDDN, which encompasses almost all dynamic networks that have been proposed. This enables us to have a single code to calculate gradients for arbitrary networks, and reduces the initial barrier to using dynamic networks. The second obstacle to dynamic network training relates to the complexities of their error surfaces. We have described some of the mechanisms that cause these complexities spurious valleys. We have also shown how to modify training algorithms to avoid these spurious valleys. We hope that these new developments will encourage the increased adoption of dynamic neural networks. REFERENCES
[1] Hagan, M., Demuth, H., De Jess, O., An Introduction to the Use of Neural Networks in Control Systems, invited paper, International

[7]

[8] [9]

[10] [11]

[12]

[13] [14]

[15]

[16] [17] [18]

[19]

[20]

[21]

Journal of Robust and Nonlinear Control, Vol. 12, No. 11 (2002) pp. 959-985. Roman, J. and Jameel, A., Backpropagation and recurrent neural networks in financial analysis of multiple stock market returns, Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences, vol. 2, (1996) pp. 454-460. Feng, J., Tse, C.K., Lau, F.C.M., A neural-network-based channelequalization strategy for chaos-based communication systems, IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 50, no. 7 ( 2003) pp. 954-957. Kamwa, I., Grondin, R., Sood, V.K., Gagnon, C., Nguyen, V. T., Mereb, J., Recurrent neural networks for phasor detection and adaptive identification in power system control and protection, IEEE Transactions on Instrumentation and Measurement, vol. 45, no. 2, (1996) pp. 657-664. Jayadeva and Rahman, S.A., A neural network with O(N) neurons for ranking N numbers in O(1/N) time, in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 51, no. 10, (2004) pp. 2044-2051. Chengyu, G. and Danai, K., Fault diagnosis of the IFAC Benchmark Problem with a model-based recurrent neural network, in Proceedings of the 1999 IEEE International Conference on Control Applications, vol. 2, (1999) pp. 1755-1760. Robinson, A.J., An application of recurrent nets to phone probability estimation, in IEEE Transactions on Neural Networks, vol. 5, no. 2 (1994). Medsker, L.R. and Jain, L.C., Recurrent neural networks: design and applications, Boca Raton, FL: CRC Press (2000). Gianluca, P., Przybylski, D., Rost, B., Baldi, P., Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles, in Proteins: Structure, Function, and Genetics, vol. 47, no. 2 , (2002) pp. 228-235. Werbos, P. J., Backpropagation through time: What it is and how to do it, Proceedings of the IEEE, vol. 78, (1990) pp. 15501560. Williams, R. J. and Zipser, D., A learning algorithm for continually running fully recurrent neural networks, Neural Computation, vol. 1, (1989) pp. 270280. De Jess, O., and Hagan, M., Backpropagation Algorithms for a Broad Class of Dynamic Networks, IEEE Transactions on Neural Networks, Vol. 18, No. 1 (2007) pp. 14 -27. De Jess, O., Training General Dynamic Neural Networks, Doctoral Dissertation, Oklahoma State University, Stillwater OK, (2002). Narendra, K. S. and Parthasrathy, A. M., Identification and control for dynamic systems using neural networks, IEEE Transactions on Neural Networks, Vol. 1, No. 1 (1990) pp. 4-27. Wan, E. and Beaufays, F., Diagrammatic Methods for Deriving and Relating Temporal Neural Networks Algorithms, in Adaptive Processing of Sequences and Data Structures, Lecture Notes in Artificial Intelligence, Gori, M., and Giles, C.L., eds., Springer Verlag (1998). Dreyfus, G., Idan, Y., The Canonical Form of Nonlinear Discrete-Time Models, Neural Computation 10, 133164 (1998). Tsoi, A. C., Back, A., Discrete time recurrent neural network architectures: A unifying review, Neurocomputing 15 (1997) 183-223. Personnaz, L. Dreyfus, G., Comment on Discrete-time recurrent neural network architectures: A unifying review, Neurocomputing 20 (1998) 325-331. Feldkamp, L.A. and Puskorius, G.V., A signal processing framework based on dynamic neural networks with application to problems in adaptation, filtering, and classification, Proceedings of the IEEE, vol. 86, no. 11 (1998) pp. 2259 - 2277. Campolucci, P., Marchegiani, A., Uncini, A., and Piazza, F., SignalFlow-Graph Derivation of On-line Gradient Learning Algorithms, Proceedings of International Conference on Neural Networks ICNN'97 (1997) pp.1884-1889. Haykin, S. and L. Li, "Nonlinear adaptive prediction of nonstationary signals", IEEE Trans. Signal Process., vol. 43, no. 2, pp.526 -535 1995.

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Improving VPN Security Performance Based on One-Time Password Technique Using Quantum Keys
Montida Pattaranantakul, Paramin Sangwongngam and Keattisak Sripimanwat
Optical and Quantum Communications Laboratory National Electronics and Computer Technology Center Pathumthani, Thailand montida.pattaranantakul@nectec.or.th, paramin.sangwongngam@nectec.or.th, keattisak.sripimanwat@nectec.or.th AbstractNetwork encryption technology has become an
essentail factor for organizational security. Virtual Private Network (VPN) or VPN encryption technology is the most popular technique use to prevent unauthorized users access to private network. This technique normally rely on mathematical function in order to generate periodic key. As the result, it may decrease security performance and vulnerable system, if high performance computing make rapid progress to reverse mathematical calculation to find out the next secret key pattern. The main contribution of this paper emphasizes on improving VPN performance by adopting quantum keys as a seed value into one-time password technique to encompasses the whole process of authentication, data confidentiality and security key management methodology in order to protect against eavesdroppers during data transmission over insecure network. Keywords- Quantum Keys, One-Time-Password , Virtual Private Network

I.

INTRODUCTION

The evolution of information technologies has been growing rapidly in order to meet human communication need today. In which the security of data transmission has always been concerned to transfer information from sender to receiver over internet channel in a secure manner. Addressing on the network security issues are the main priority concern to protect against unauthorized users since the security technique should also cover data integrity, confidentiality, authorization and further non-repudiation services. The lack of adequate knowledge with well known understanding of software architecture and security engineering leads to security vulnerabilities due to the eavesdroppers migh be able to gain information by monitoring the transmission for pattern of communication, the capability to detect data packets during transmission over internet, or enable to access information within private data storage that may lead to the occurance of data loss and data corruption. This is the critical factor cause to new threads arise and may effect business objectives change. In terms of worst case scenario this will definitely affect to organizational stability, business opportunities and then may become a national security threat. For this reason, many organizations have to pay attention in order to find out the way to protect their information from eavesdroppers based on security technology solutions that agile enough to adapt itself and combat with an existing threats due to security breach.

Therefore, data reliability and security protection are primary concern for information exchange through unprotected network connections in order to verify user, since only an authorized user can entrance and ability to govern the resource access, while encryption technology is also required for further data protection. Presently, there are several types of cryptography [1] that have been used to achieve comprehensive data protection based on proven standard technology due to it is the most important aspect of network and communication security which provide as a basic building block for computer security. According end to end security encryption typically rely on application layer closest to the end user thus only data is encrypted. While, network security encryption where IPsec comes into play to encompasses confidentiality area by encapsulating security payload in both transport mode operation and tunnel mode operation through this type of encryption the entire IP packet including headers and payload are encrypted. IPsec encryption based on Virtual Private Network technology [2] presents an alternative approach for network encryption since it fully provide trusted collaboration framwork to be able to communicate each other over private network. Nevertheless, user authentication mechanism, cryptographic algorithms, key exchange procedure and traffic selector information need to be configured and maintained among two endpoints in order to establish VPN trusted tunnel before data transmission begins. Although, the widespread usage of classical VPN can improve data transfer rate with maximum throughput, minimum delay and well guranteed on non bottleneck occurrence due to every communication routes is built the shortest part communication with independent IPsec to improve elastic traffic performance. In contrast, key exchange procedure during VPN setup is still a major hurdle process of vulnerability, if either secret key get trapped or key pattern broken up. In addtion, most of the random numbers have been used as the secret keys into cryptographic algorithms derived based on mathematical functions. This key generation manner is the one of the potential security vulnerabilities for data communications when computer technology become a high performance computing such make rapid progress to reverse mathematical calculation to find out secret key value. One-time password mechanism [3] using quantum keys as a seed value into hash function can be solved a traditional VPN security problem in which it can eliminate the

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

spoofing attack caused by an eavesdropper has successfully masquerades as another by falsifying data and thereby gaining an illegitimate advantage. The main contribution of this paper emphasizes on improving VPN security performance that adopted one-time password technique to generate corresponding once symmetric key upon a time for further VPN tunnel establishment as using quantum keys as a seed value. Thus, the two endpoints are typically authenticated themselves in a secure manner process which rely on confidential protection. While, quantum keys have been proposed to avoid repeating the same password several times due to traditional password creation was derived from mathematical calculation may lead to system vulnerabilities. Addressing on quantum keys bring perfectly security enhancement of password generation due to the beauty of quantum key distribution (QKD) [4] promises to revolutionize secure communication by providing security based on the fundamental laws of physics [5], instead of the current state of mathematical algorithms or computing technology [6]. This paper is organized as follows. Section II an overview of VPN architecture and mechanism where the theory has been applied upon design processes, technical solution and implementation approach. In section III gives a details view to design a VPN security architecture for further VPN tunnel establishment. Since, the entire information are transferred through this corresponded tunnel regarding to the authorization control. Section IV comparison and analysis of an existing VPN security method and proposed challenge idea will be disscussed. Finally, some concluding remarks and future works are mentioned in section V. II. VPN ARCHITECTURE AND MECHANISM

Figure 1. The scenario of VPN encryption techniques

management service has been concerned in order to handle a model of secure keys exchange protocol. III. DESIGNING A NEW VPN SECURITY ARCHITECTURE

Fortunately, there are several network encryption technologies that have been used to protect private information from eavesdroppers over insecure network. At the current VPN encryption technology has become an attractive choice and widely used for protection against network security attacks. VPN encryption mechanism normally process as Client/Server operation in order to establish a direct tunnel between source address and destination address while the virtual private network is built up. All data packet are consecutively passed over VPN tunnel. Due to the merit of VPN technology can reduce network cost consumption cause from physical leased lines, so that the users can exchange such private information with high data protection and trust. In addition, VPN architecture is encompassed based on authentication [7], confidentiality and key management functional areas. According to authentication service is typically used to control the users when entrance into the system, only authorized user able to do forward to encrypt a tunnel during process of VPN connection start up. As the result, the authentication header is inserted between the original IP header and the new IP header shown in figure 1. Next, confidentiality service provide message encryption to prevent eavesdropping by third parties. Finally, Key

Basically, VPN tunnel encryption can be classified into two main methods. There are public key encryption method and symmetric key encryption method. The paper has been addressed only on symmetric key encryption that adopted one-time password mechanism such one time key encryption is used over time when VPN connection start up and finally destroy the keys when disconnection. Since, the one-time keys is originated from quantum keys as a seed value into hash function [8][9]. The mechanism is covered both user authentication process and tunnel establishment in order to prevent against data integrity problem. Therefore, the overview of VPN security architecture are mainly divided into three major modules. A. User Registration Module In order to improve VPN security performance such a connection, user registration module must required for either first time entry or password is expired. This module will be activated when new users enroll to the system to request legitimate password. The figure 2 show user registration procedure that each idividual process are explained as follow. The result of this step will indicate the corresponding password to those users. Such the password will be essentail used in the step of user authentication and negotiation. 1) New user login/ Password expired: This case can be occurred with two reasons. When new users who need to register into the server want to ask for the legitimate password, or their password are expired due to it exceeded the password life time. So, the registration phase will be activated to regenerated a new password. 2) Request for the password: The users transfer his/her identity information including an official name, identification number/passport number, date of birth, address and so on to the server in order to request a legitimate password. The correspondence user

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Figure 2. User rigistration phase

3) 4)

5) 6)

information will be stored in the user right database for further reference. Random a unique 10 digit code: Generate a legitimate password based on random selection from Quantum Key Distribution (QKD) device. Store username and password: The username and the legitimate password where created from QKD device called quantum keys are fabricated into the basic form of hash function. Therefore, only the corresponding username and the password hash value will be stored in the user right database wherewith the server does not also know the exact password value. Transfer the legitimate password: The legitimate password will transfer back to the corresponding user across trusted channel to avoid password attack. Treated as confidential information: Thus, the legitimate pasword will be used to verify user him/herself in authentication phase whether the user is authorized to perform VPN establishment.

Figure 3. User authentication phase

3)

4)

5) 6) 7)

B. User Authentication Mechanism Module User authentication procedure maintains high level of security through one-site checking. Before creating VPN tunnel, the users must verify themselves with the server by logging into the system with corresponding username and password when the users had been registered at the beginning of the registration phase. The stages of the process are similar to S/key authentication mechanism [10] show in the figure. 3. Hence, only the authorized user can go forward to create a secure VPN tunnel. The user authentication procedures can be explained as follows. 1) Logging into the system: After user registration phase finished, if the user wish to create secure VPN tunnel then authentication phase will proceed respectively. Since, the username and the password must register to the server for the authentication whether a user is authorized to perform a task. 2) Received username and password: At the start service mode begins, the server is waiting for user call. When

the server get the signal of user authentication, the username and the password will be temporal stored as the input of hash function. Password expiration checking: According to this function will examine the password life cycle due to the password life time is exceeded than the permitted allowance which may decrease security performance. Therefore, password expiration checking procedure was introduced to avoid against password attack. Alert the password is expired: The password expiration result will send back to the corresponding user to notify whether password is valid or invalid. Invalid password result will return back to user registration phase in order to re-enrollment again, otherwise continue to the next procedure. Computing password hash value: Computing the password hash value with the corresponding username and the password obtained from the previous process. Comparing with an existing password hash valule: Comparing the calcuated password hash valule with an existing password hash value in the user right database. Alert username and password are invalid: The comparision result will acknowledgement back to the corresponding user such only authorized user able to continue performing VPN tunnel establishment .

C. VPN Tunnel Establishment Based On One-Time Password Mechanism Module The proposed technique has adopted two unique features of one-time password mechanism and the promise of quantum key exchange. One-time password mechanism which each password is used only once upon a time and frequenly updated when a new connection has been established to eliminate the possibility of attack that may come from replay attack, spoofing attack or birthday attack. In addition, using one-time password mechanism based on hash chain are more elegant design and attractive properties such to be able to reach the high security performace.

10

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

3) 4)

5)

6)
Figure 4. VPN tunnel establishment phase

While, applying the quantum keys as a seed value into a hash function will improve more efficiency and security against the system. The fascination of quantum technology that uses polarization property to ensure the transmistted keys. In which the keys can not be trapped by eavesdroppers due to it may affect the key error rate more than a certain threshold vulue. Moreover, the VPN tunnel establishment based on one-time password mechanism using quantum keys as a seed input to the hash function can be illustrated as the figure 4. Such, creating a hash chain value is called the response value which it is computed from both the password phase, the quantum seed and the sequence number in order to establish high secure VPN tunnel. The process get started reversing from the Nth element of the hash chain which referred from the sequence number in order to identify the current position of response value to be used and finally destroy after the VPN tunnel get disconnected. The procedure of creating high secure VPN tunnel are explained as follows. 1) Manual copy quantum key and sequence number: At the first time of VPN tunnel setting up, the quantum keys as consider as a seed value (QKS) which it was generated from Quantum Random Number Generator (QRNG) [11]. While, the sequence number (SN) is used to indicate the hashing order. Both of the two values are manually distributed to the user in such a way that to prevent being attacked. These key values will be the input of the hash function. For any further reestablished a VPN tunnel, this procedure will not be proceeded until the sequence number become zero. 2) Generate response value at user site: The user need to complete the legitimate password acquired from the registration phase, the quantum key seed and the sequence number which allocated from the server to

7) 8)

each particular user such the inputs of hash function. Thus, an intermediate response value will be generated after finish its processing. Transfer response value: The response value from the proper user site has been typically transmitted to the server in order to identity comparation . Generating response value at the server site: The server will also generate its response value based on the relevant user information that had been stored in a user right database. Comparing two response values: These the two response values are compared with each other. Matching result will be considered as a symmetric key encryption for VPN tunnel establishment. According to this procedure is very attractive approach due to it offer high security protection without performing a key exchange over insecure network. Establishing a VPN tunnel at the server site: If the two response values are matched, then go forward to establish a tunnel. Creating a one site tunnel for secure connections the server will assigns virtual IP address to the user from local virtual subnet. Establishing a VPN tunnel at user site: The procedue for setting up VPN tunnel at user site are similar to the one at server site via extenal network interface. Site-to-Site VPN tunnel: The user and the server use virtual network interface to maintain a encrypted virtual tunnel. The entire information such as the actual user data, the ultimate source and the destination address are carried as a payload with authentication header. Lastly, the virtual IP address is inserted to the packet before transmission. IV. COMPARISION AND ANALYSIS

This paper performs a comparasion and analysis for the qualities of VPN connection using the different types of encryption techniques including the term of symmetric key encryption, public key encryption. Finally, the proposed system has been distinguished since the technique adopted from classical encryption technique with added more features of one-time password mechanism and quantum keys to improve security performance. A. Key Pattern Generator And Its Properites Most of high secure communications are belonged with two important key factors. One is the quality of random number using as a cryptographic key and the two is the complexity of encryption algorithms. The algorithm will produce a different output that depend on the specific key being used at the time. Addressing to the sources of key generator can be produced into two different ways. Pseudo random number generators is the algorithms uses regarding to the mathematical functions, or simply precalculated tables in order to produce the sequence numbers to appear as the random value based on periodic pattern since this technique may decrease the securiy key performance due to it is feasible to find the next value of key generation from the existed

11

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE I.

PERFORMANCE COMPARASION OF DIFFERENTIAL EVOLUTION TECHNIQUES


Properties of VPN Connections Symmetric Key Encryption Public Key Encryption Proposed Mechanism

Features
Key Pattern Generator Key Properties Security Key Protection and Performance Key Exchange Protocol

Periodic pettern based on mathematical functions Pseudo random number based on mathematical functions Not provided any key protection mechanism Secret key exchange with one classical link communication Secret key encryption

Periodic pettern based on mathematical functions Pseudo random number based on mathematical functions Not provided any key protection mechanism Public key exchange protocol with one classical link communication Public and private key encryption

Random pettern based on quantum phenomenon True random number based on quantum physics laws Quantum Bit Error Rate (QBER) ratio Secret key exchange with two data link communition (Quantum channel and classical channel) Secret key encryption based on one-time password mechanism using quantum keys

Mechanism Used

pattern. As the result, it can taking to the risks associated with the use of pseudo random numbers to produce a key into cryptography systems. While, the proposed technique has applied the challenge of quantum keys to be used in both password generation and VPN tunneling encryption. Concerning to the performance of quantum keys which is actually true random number generator based on quantum physics use. The fact that subatomic particles appear to behave randomly in circumstances which difficult to find out the key value. The probability of key generation based on aperiodic pattern along with the random distribution scheme can increase the quality of key performance as great as the data security enhancement. B. Security Key Protection and Performance One of the best characteristics of QKD technology offers a promising unbreakable way to secure communications. In the way that eavesdroppers are trying to attempt to intercept the quantum keys during the key exchange state is detectable by introducing an abnormal Quantum Bit Error Rate (QBER). The result may occur error rate more than a certain threshold value due to unavoidable disturbance including an imperfect system configuration, noise or eavesdropper across to the quantum channel and secret key generation with respect to time. Hence, the exploitation of quantum mechanics offer the perfectly secure communications. C. Key Exchange Protocol In general QKD technology describes the process of using quantum communication to establish a shared secret key between two parties which is similar to the secret key exchange architecture. The proposed technique has applied the quantum keys acquired from the QKD system and then take forward to distribute to each responding user in secure mode. When the server received the signaling such password request, a partial key is dedicated to each user for further identifying as well as some of the quantum keys has assigned as a seed value into hash function in order to generate a particular secret key use to establish a VPN tunnel for secure data communications.

D. Mechanism Used The promising of VPN encryption technology leads to client/server confidentiality. Thus, the proposed technique is focused to establish the high secure VPN tunnel such the technique has applied the challenges of one-time password mechanism as using the beauty of quantum keys, the sequential number and the user secret password in order to produce the response value as a specific symmetric key. These symmetric key will be use once at a particular time and later destroy after the VPN tunnel disconnected. In addition, new symmetric key will be periodically change along the one-way hash function properites which able to enhance data confidentiality and network security protection. V. CONCLUSIONS AND FUTURE WORK

Improving VPN security performance based on onetime password technique using quantum keys presents a new trend mechanism to protect against data snooping from eavesdroppers when data is transmitted over insecure network. The proposed technique has offered three main procedures of concern, user registration stage, user authentication stage and VPN tunnel establishment stage in order to figure out various vulnerabilities and attacks. User registration stage performs key generation process to create the secret password to the responding user for further authentication. Since, the secret password is truly random numbers based on QKD technology which typically rely on the beauty of key exchange method to protect against eavesdroppers rather than the pseudo random numbers derived from mathematical functions due to the probability of brute force attacks may occur, if the password can be guessed. User authentication stage provides the legitimate users with transparent authentication in such a way that managing and monitoring access to private resources. Morover, the password life cycle management and hash functions have also been applied to solve security vulnerabilities. Finally, VPN tunnel establishment stage based on one-time password mechanism bring to an attractive approach to build up the highest security of virtual private

12

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

network thus the particular key will be used only once and destroy. In addition, the challenges of the proposed mechanism is a part of project of high efficiency key management methodology for advanced communication services (a pilot study for video conferencing system) under the user authentication and VPN establishment phase in order to prevent against unauthorized access into the restricted network and the illegal resources before distributing the quantum keys for further along secure video conferencing and any data communications services as considering the data protection and network security are the main priority for IT organization need to be concerned. ACKNOWLEDGMENT The authors would like to thank Dr. Weetit Wanalertlak for the invaluable feedback and the technical support to achieve such the supreme excellence and the perfection of research paper. While, the authors would like to thank NECTEC steering committee for research support funding with a great valuable opportunity for team to introduces the new challenges of data protection as well as to improve data reliability over insecure network. Finally, the authors would like to thank Mr. Sakdinan Jantarachote and all staffs of Optical and Quantum Communications Laboratory (OQC) for all kind support and encouragement.

REFERENCES
[1] William Stallings, Cryptography and Network Security Principles and Practices, Fourth Editio, November 2005. [2] Kazunori Ishimura, Toshihiko Tamura, Shiro Mizuno, Haruki Sato and Tomoharu Motono, Dynamic IP-VPN architectuire with secure IP tunnels, Information and Telecommunication Techonologies, June 2010, pp. 1-5. [3] Young Sil Lee, Hyo Taek Lim and Hoon Jae Lee A Study on Efficient OTP Generation using Stream with Random Digit,Internatinal Conference on Advanced Communication Technology 2010, volume 2, pp. 1670-1675. [4] W. Heisenberg, Uber den anschaulichen Inhalt der quantenheoretischen Kinematik und Mechanik In: Zeitschrift fur Physik, 43 1927, pp. 172-198 [5] W.K. Wootters and W.H. Zurek, A Single Quantum Cannot be Cloned, Nature 299, pp. 802-803, 1982. [6] Erica Klarreich, Quantum Cryptography: Can You Keep a Secret, Nature, 418, 270-272, July 18, 2002. [7] Hyun Chul Kim, Hong Woo Lee, Kyung Seok Lee, Moon Seong Jun, A Design of One-Time Password Mechanism using Public Key Infrastructure, Fourth Internatinal Conference on Network Computing and Advanced Information Management, September 2008, pp. 18-24. [8] Harshvardhan Tiwari, Cryptographic Hash Function: An Elevated View, European Journal of Scientific Researchm, ISSN 1450=216X, Vol.43, No.4 (2010), pp. 452-465. [9] Peiyue Li, Yongxin Sui, Huaijiang Yang and Peiyue Li The Parallel Computation in One-Way Hash Function Designing, International Conference on Computer, Mechatronics, Control and Electronic Engineering, Aug 2010, pp. 189-192. [10] C.J.Mitchell and L. Chen, Comments on the S/KEY user authentication scheme, ACM Operating Systems Review, Vol.30, No. 4 , 1996, 10, pp. 12-16. [11] ID Quantique White Paper, Random Number Generation Using Quantum Physics, Version 3.0, April 2010.

13

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Experimental Results on the Reloading Wave Mechanism for Randomized Token Circulation
Boukary Ouedraogo
PRiSM - CARO, UVSQ 45, avenue des Etats-Unis F-78035 Versailles Cedex, France Email: boukary.ouedraogo@ens.uvsq.fr

Thibault Bernard
CRESTIC - Syscom, URCA, Moulin de la Housse BP-1039 F-51687, Reims cedex 2, France Email: thibault.bernard@univ-reims.fr

Alain Bui
PRiSM - CARO, UVSQ 45, avenue des Etats-Unis F-78035 Versailles Cedex, France Email: alain.bui@prism.uvsq.fr

AbstractIn this paper, we evaluate experimentally the gain of a distributed mechanism called reloading wave to accelerate the recovery of randomised token circulation algorithm. Experimentation will be realised under different context: static networks and dynamic networks. The impact of different parameters such as connectivity or frequency of failures will be investigated.

I. I NTRODUCTION Concurrence control is one of the most important requirement in distributed systems. The emergence of wireless mobile networks has renewed the challenge to design concurrence control solutions. These networks require a new modeling and new solutions to take into account their intrinsic dynamicity. In [Ray91], the author classies concurrency control in two types: quorum based solutions and token circulation based solutions. Numerous papers deals with token circulation based solutions because they are easier to implement: a single token circulating represents the privilege to access the shared resource (unicity of the token guarantee the safety, and perpetual circulation among all nodes guarantee the liveness). In the context of dynamic networks, random walks based solution have been designed(see [Coo11]). Properties of random walks allow to design a traversal scheme using only local information [AKL+79]: such a scheme is not designed for one particular topology and need no adaptation to other ones. Moreover, random walks offer the interesting property to adapt to the insertion or deletion of nodes or links in the network without modifying any of the functioning rules. With the increasing dynamicity of networks, these features are becoming crucial: redesigning a new browsing scheme at each modication of the topology is impossible. An important result of this paradigm is that the token will eventually visit (with probability 1) all the nodes of a system. However it is impossible to capture an upper bound on the time required to visit all the nodes of the system. Only average quantities for the cover time, dened as the average time to visit all the nodes are available. The token circulation can suffer different kinds of failures: in particular, (i) situations with no token and (ii) situation with multiple tokens may occur. Both of them have to be managed to guarantee the liveness and safety properties of concurrence control solutions.

The concept of self-stabilization introduced in [Dij74] is the most general technique to design a system to tolerate arbitrary transient faults. A self-stabilizing system guarantees to converge to a legitimate state in a nite time no matter what initial state it may start with. This makes a self-stabilizing system be able to recover from transient faults automatically without any intervention. To design self-stabilizing token circulation, numerous authors build and maintain spanning structures like tree or ring (cf. [CW05], [HV01] and use the counter ushing mechanism ([Var00]) to guarantee the presence of a single token. In the case of a random walk based token circulation, the counter ushing can not be used. In [DSW06], the authors use random circulating tokens (they call agents) to broadcast information in communication group. To cope the situation where no agent exists in the system, authors use a timer based on the cover time of an agent (k n3 ). They precise as a concluding remark. The requirements will hold with higher probability if we enlarge the parameter k for ensuring the cover time[. . . ]. In the case of a concurrence control mechanism the obtention of a single token is a strong requirement, and the use of a parameter k which increases the probability to reach a legitimate conguration could not be used. We have introduced in [BBS11], the reloading wave mechanism. This mechanism insures the obtention of single token and then the safety property of concurrence control solution. We propose in this paper an experimental evaluation of this mechanism under different parameters: timeout initialization, connectivity of the network, dynamicity of the network and failures frequency. In order to test or validate a solution, the authors of [GJQ09] proposed four classes of methodologies: (i) in-situ where one execute a real application (program, set of services, communications, etc.) on a real environment (set of machines, OS, middleware, etc.), (ii) emulation where one execute a real application on a model of the environment, (iii) benchmarking where one execute a model of an application on a real environment and (iv) simulation where one execute a model of an application on a model of the environment. To each of these methodologies corresponds a class of tools: real-scale environments, emulators, benchmarks and simulators.

14

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

In this paper, we adopt the simulation class of methodologies and use simulators due to the fact that simulation allows to perform highly reproducible experiments with a large set of platforms and experimental conditions. Simulation tools support the creation of repeatable and controllable environments for feasibility study and performance evaluation [GJQ09], [SYB04]. Simulation tools for parallel and distributed systems can be classied into three main categories: (i) Network simulation tools, Network Simulator NS-2 is a simulator that supports several levels of abstraction to simulate a wide range of network protocols via numerous simulation interfaces. It simulates network protocols over wired and wireless networks. SimJava [HM98] provides a core set of foundation classes for simulating discrete events. It simulates distributed hardware systems, communication protocols and computer architectures. (ii) Simulation tools for grids, the most common tools include: GridSim [BM02] supports simulation of spacebased and time-based, large scale resources in the Grid environment. SimGrid [CLA+08], simulates a single or multiple scheduling entities and timeshared systems operating in a Grid computing environment. It simulates distributed Grid applications for resource scheduling. Dasor [Rab09] is a C++ library for discrete event simulation for distributed algorithms (Management of networks (with topologies), Failure models, mobility models, communication models . . . , structures (trees, matrices, etc.) . . . . It is based on a multi-layers model (Application, Grid Middleware, Network). (iii) Simulation tools for peer-to-peer networks, PeerSim [MJ09] supports extreme scalability and dynamicity. It is composed of two simulation engines, a simplied (cyclebased) one and an event driven one. In [GJQ09], the comparison made between different simulators for Networking and Large-Scale distributed systems show that any such tool provides a very high control of the experimental conditions (only limited by tool scalability), and a perfect reproducibility by design. The main differences between the tools are (i) about the abstraction level (moderate for network simulators, high for grid ones and very high for P2P ones), (ii) the achieved scale also greatly varies from tool to tool. In the second section we present the model of the token circulation algorithm that uses the reloading wave mechanism. In the third section, we propose an experimental evaluation of the reloading wave mechanism. Finally we will conclude the paper by presenting perspectives. II. T HE RELOADING WAVE MECHANISM The reloading wave mechanism has fully described in [BBS11]. This mechanism has been designed to satisfy the specications of a single token circulation in presence of faults: (i) there is exactly one token in the system,

(ii) each component of the system is innitely often visited by the token. The random walk token moving scheme insures to get the second part (ii) of the specication veried (as soon there is no adversary that plays with mobility of the components against the random moves of the token). Starting from an arbitrary conguration, the rst part of the specications can be entailed by following situations: absence of token, multiple tokens situation. To manage the absence of token, each node setups a timeout mechanism. Upon a timeout triggering, a node creates a new token, and then the absence of token situation no longer occurs. The multiple token situation is managed like in [IJ90]: when several tokens meets on a node, they are merged into one by a mergure mechanism. But unfortunately the combinaison of the two mechanisms does not guarantee the presence of exactly one token: if a subset of nodes is not visited by the token during a sufciently long period, tokens creation can still occurs even if there already exists a token.The goal of the reloading wave mechanism is to prevent these unnecessary creations of tokens. This prevention is realized by the token itself: it periodically propagates an information meaning that it is still alive. The reloading wave uses several tools for its operations: a timeout mechanism: All nodes in the network assume a timeout procedure: a timer whose value decrements at each tick clock. At the expiration of a node timer, the corresponding node creates a new token and sends it to one of its neighbors as a random walk moving scheme. Remark that several tokens can circulate in the network. A adaptive spanning structure of the network topology that is stored in the token. The spanning structure is stored as a circulating word that represents a spanning tree. This tree is used to propagates the reloading wave. Every node that receives a reloading wave message resets its local timer and then propagates the reloading wave message to all its sons according to the spanning tree maintained in the word of the token. a hop counter that is stored in the structure of the token: Initialized to zero when creating the token, this hop counter is incremented at every step of the random walk. It is reseted to zero each time the node that owns the token triggered a reloading wave propagation. The different phases of the reloading wave mechanism are the following: 1) Phase of reloading wave triggering At the reception of a token on a node, the word content and the hop counter of the token are updated. The reloading wave mechanism begins as soon as a node, at the reception of a token, is aware that the triggering condition is satised. The triggering condition is: if a received token hop counter (NbHop) is equal to the difference between initialization value of timer (Tmax ) and network size value (N). In

15

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

other words, the reloading wave is triggered at each interval of (Tmax N ) steps of the token random walk. During this phase, the hop counter of the token is reset to zero. Reloading wave Messages are created by nodes at the initiative of a token more precisely its hop counter (NbHop). Several reloading waves can be created (simultaneously or not) and propagated through the tree maintained in the token word. 2) Phase of reloading wave propagation The propagation of the wave takes place along an adaptive tree contained in the word of the circulating token. Every node that receives a reloading wave message resets its local timeout and then propagates too the wave to all its sons according to the adaptive tree maintained in the token word. 3) Phase of reloading wave termination The reloading wave mechanism terminates when reloading waves reached all nodes of the virtual tree maintained in token word or when transient faults obstructed its diffusion. The complete implementation of the mechanism can be found in [BBS11]. In the next section, we experiment the reloading wave mechanism to evaluate its relevance. III. E XPERIMENTAL RESULTS Our simulation model is written in C++ using DASOR [Rab09], a C++ library for discrete event simulation for distributed algorithms. The DASOR library provides lots of interesting structures and tools that makes an easy way to write simulators. We investigate experimental results of the reloading wave mechanism under three contexts: static network (no node connection / disconnection, no failure), dynamic network (node connection / disconnection, no failure) and network subject to failure (node connection / disconnection, token creation / deletion). A. Experimental protocol For each parameter investigated, we measure the time elapsed in a satisfying conguration for two solutions: 1 A solution where a token is circulating according a random walk scheme. A timeout is initialized on each node, to eventually create new tokens when expiring. The merger mechanism is triggered when several token are present on the same node. 2 The same protocol but with the addition of the reloading wave mechanism as described in the previous section. A satisfying conguration is a conguration where there is exactly one token present in the system. For each set of parameters, we present the result as a difference between the time elapsed in satisfying conguration with solution 2 and the time elapsed in satisfying conguration with solution 1. We evaluate the impact of several important parameters:

Size of the network Timeout initialization Mobility range of the nodes Frequency of failures

B. Experimentation Each experimentation will be repeated 100 times, all the results obtained are the mean over all the experimentation. The standard deviation has been computed but is negligible. 1) Static networks, impact of size and of the timeout initialization: We set up the timeout values in function of the size of the network (n). We take 2n, 3n, 4n, 5n as timeout initialization. Intuitively the greater the timeout is, the less the difference between with and without reloading wave is, since token creation occur on a timeout triggering (a non necessary token creation compromises the satisfying conguration). On the other hand, the greater the size is, better the solution with reloading wave work since the mechanism avoids all unnecessary token creation. The solution without the reloading wave mechanism have to insure the visit of all nodes during a timeout period to avoid token creation. Greater the size of the network is, more difcult is to visit all nodes of the network with a random moving policy. The results are given in TableI with the form T1 T2 = where T1 is the percentage of time elapsed in a satisfying conguration with the solution with reloading wave, T2 the percentage without using the reloading wave mechanism and the difference.
Size n 50 100 200 300 2n 99-22= 77% 99-16= 83% 99-11= 88% 99-10= 89% Timeout initialization T = f (n) 3n 4n 99-46= 53% 99-67= 32% 99-38= 60% 99-58= 41% 99-32= 67% 99-50= 49% 99-27= 72% 99-46= 53% Table I D IFFERENCE BETWEEN
THE SOLUTION WITH RELOADING WAVE AND THE SOLUTION WITHOUT RELOADING WAVE FOR STATIC NETWORKS

5n 99-82= 17% 99-74= 25% 99-66= 33% 99-62= 37%

Our intuition is veried: the reloading wave avoid all unnecessary token creations (the system is in a satisfying conguration 99% of time, the last 1% correspond to the initialization phase, where there is not enough collected data to propagate the reloading wave). The size decrease the performance of the solution without the reloading wave and the timeout improves its performance. 2) Dynamic networks, impact of the dynamicity and failures: A dynamic network is subject to topological recongurations and failure, we investigate the impact of these two parameters on the behavior of the reloading wave mechanism. The two solutions (with and without reloading wave) have been experimented on a random graph of 300 nodes with a density of 60% (i.e. a link between two nodes has the probability 0.6 to exist at the initialization of the network). We have used a mobility pattern where: (i) the movements of nodes are independent of each other, (ii) at any time

16

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

there is a xed number of nodes randomly chosen that are disconnected, (iii) the duration of the disconnection is set arbitrary to 1 time unit. This model can be assimilated to the random walk mobility model (cf. [CBD02]). We set this parameter to get: A low mobility pattern: at a given time, 1% of nodes are disconnected. This value is reasonable to evaluate the performance of the algorithm in the conditions of a slow moving network. An average mobility pattern: at a given time, 5% of nodes are disconnected. This value is reasonable to evaluate the performance of the algorithm in the conditions of a medium speed moving network. A high mobility pattern: at a given time, 10% of nodes are disconnected. This value is reasonable to evaluate the performance of the algorithm in the conditions of a fast moving network. In the same way, a failure model has been applied: all token messages have the same probability p to fail at every time interval t. We set p to 0.05%, since it seems to be a realistic value for a message loss in a network and t to: A low failure pattern: Each 1000 turns, every token has a probability 0.05% to be lost. An average failure pattern: Each 100 turns, every token has a probability 0.05% to be lost. A high failure pattern: Each 10 turns, every token has a probability 0.05% to be lost. Results are given in Table. II with the form T1 T2 = where T1 is the percentage of time elapsed in a satisfying conguration with the solution with reloading wave, T2 the percentage without using the reloading wave mechanism and the difference.
Mob. freq. None Low Average High None 99-10= 89% 34-31= 3% 34-31= 3% 34-31= 3% Token loss frequency Low Average 9-12= -3% 0-0= 0% 30-29= 1% 26-26= 0% 29-29= 0% 26-26= 0% 30-29= 1% 26-26= 0% High 0-0= 0% 25-25= 0% 25-25= 0% 25-25= 0%

has a marginal gain on the solution without reloading wave when there is no token loss (about 3%). We think the reloading wave could be use for network with a very slow mobility pattern: if the frequency of node move is low, the spanning tree stored inside the token has enough time to be updated, and the reloading wave mechanism could work correctly. IV. C ONCLUSION In this paper, we have investigated experimental results of the reloading wave which is a mechanism to avoid unnecessary token creations, in static networks, dynamic networks and network subject to failure. In a static environment, the reloading wave works perfectly (about 99% of satisfying congurations, the 1% remaining corresponds certainly to the initialization of the spanning structure on which the reloading wave is broadcast). The difference between the two solutions (with / without reloading wave) increases with the augmentation of the timeout value and decreases with the size of the network. The mobility of nodes has an impact on the functioning of the reloading wave: mobility of nodes can break the spanning structure used to broadcast the reloading wave. In [BBS11] we exhibit a mobility pattern on which the reloading wave works correctly. In our experimentation this mobility pattern has not be implemented. We think the mobility used in the experimentation is too important to t the criterion of the mobility pattern on which the reloading wave works. A new set of experimentation on the mobility pattern is investigated. The occurrence of failures has an impact on the reloading wave mechanism. As the mechanism is initiated by the token, a token loss occurring frequently, decreases highly the performances of reloading wave (about 25% of satisfying congurations with the given parameters). In most token circulation algorithms, token loss is considered as an improbable event and a recovery has to be manage carefully. Our solution is not an exception to the rule, a recovery takes a long time (according timeout value) being elapsed in a set of non satisfying congurations. R EFERENCES
[AKL+79] R. Aleliunas, R. Karp, R. Lipton, L. Lovasz, and C. Rackoff. Random walks, universal traversal sequences and the complexity of maze problems. In 20th Annual Symposium on Foundations of Computer Science, pages 218223, 1979. [BBS11] Thibault Bernard, Alain Bui, and Devan Sohier. Universal adaptive self-stabilizing traversal scheme: random walk and reloading wave. CoRR, abs/1109.3561, 2011. [BM02] Rajkumar Buyya and Manzur Murshed. Gridsim: A toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing, concurrency and computation. Practice and Experience (CCPE), 14:Issue 1315, 11751220, December 2002. [CBD02] T. Camp, J. Boleng, and V. Davies. A survey of mobility models for ad hoc network research. Wireless Communications and Mobile Computing (WCMC): Special issue on Mobile Ad Hoc Networking: Research, Trends and Applications, 2(5):483502, 2002. [CLA+08] Henri Casanova, Legrand, Arnaud, Quinson, and Martin. SimGrid: a Generic Framework for Large-Scale Distributed Experiments. In 10th IEEE International Conference on Computer Modeling and Simulation, March 2008.

Table II D IFFERENCE BETWEEN THE SOLUTION WITH RELOADING WAVE AND THE
SOLUTION WITHOUT RELOADING WAVE FOR DYNAMIC NETWORKS

Frequent token loss decrease greatly the performance of both solutions (for the given token loss frequency parameter, between 30% and 25% of satisfying congurations). The gain of the reloading wave in the static context is marginal in the context where token can be lost (less than 1% for low, average and high token loss frequency). This is not surprising: the reloading wave mechanism relies on the persistence of the token. As soon as the token can be lost, the spanning tree stored inside the token can not be built, and then several nodes can create new tokens. The impact of the mobility is not the same. Of course, a too frequent mobility pattern decreases highly the performance of both solutions, but it remains that the reloading wave solution

17

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

[Coo11] Colin Cooper. Random walks, interacting particles, dynamic networks: Randomness can be helpful. In 18th International Colloquium on Structural Information and Communication Complexity, Gdansk, Poland, June, 2011, volume 6796 of Lecture Notes in Computer Science, pages 114. Springer, 2011. [CW05] Yu Chen and Jennifer L. Welch. Self-stabilizing dynamic mutual exclusion for mobile ad hoc networks. J. Parallel Distrib. Comput., 65(9):10721089, 2005. [Dij74] Edsger W. Dijkstra. Self-stabilizing systems in spite of distributed control. Commun. ACM, 17(11):643644, 1974. [DSW06] S. Dolev, E. Schiller, and J. L. Welch. Random walk for selfstabilizing group communication in ad hoc networks. IEEE Trans. Mob. Comput., 5(7):893905, 2006. [GJQ09] Jens Gustedt, Emmanuel Jeannot, and Martin Quinson. Experimental methodologies for large-scale systems: a survey. Parallel Processing Letters, 19(3):399418, 2009. [HM98] Fred Howell and Ross McNab. Simjava: A discrete event simulation package for java with applications in computer systems modelling. In Proceedings of the First International Conference on Web-based Modelling and Simulation, San Diego CA, January 1998. Society for Computer Simulation. [HV01] Rachid Hadid and Vincent Villain. A new efcient tool for the design of self-stabilizing l-exclusion algorithms: The controller. In Ajoy Kumar Datta and Ted Herman, editors, WSS, volume 2194 of Lecture Notes in Computer Science, pages 136151. Springer, 2001. [IJ90] Amos Israeli and Marc Jalfon. Token management schemes and random walks yield self-stabilizing mutual exclusion. In PODC, ACM, pages 119131, 1990. [MJ09] Alberto Montresor and M rk Jelasity. PeerSim: A scalable P2P a simulator. In Proc. of the 9th Int. Conference on Peer-to-Peer (P2P09), pages 99100, Seattle, WA, September 2009. [Rab09] C. Rabat. Dasor, a Discret Events Simulation Library for Grid and Peer-to-peer Simulators. Studia Informatica Universalis, 7(1), 2009. [Ray91] Michel Raynal. A simple taxonomy for distributed mutual exclusion algorithms. Operating Systems Review, 25(2):4750, 1991. [SYB04] Anthony Sulistio, Chee Shin Yeo, and Rajkumar Buyya. A taxonomy of computer-based simulations and its mapping to parallel and distributed systems simulation tools. Softw. Pract. Exper., 34:653673, June 2004. [Var00] George Varghese. Self-stabilization by counter ushing. SIAM J. Comput., 30(2):486510, 2000.

18

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Statistical-based Car Following Model for Realistic Simulation of Wireless Vehicular Networks
Kitipong Tansriwong and Phongsak Keeratiwintakorn Department of Electrical Engineering, King Mongkuts University of Technology North Bangkok, Bangkok, THAILAND kit-ee@hotmail.com and phongsakk@kmutnb.ac.th
Abstract In present, research on mobile and wireless vehicular networks has been focused on the communication technologies. However, the behavioral study of vehicular networks is also important, and can be costly. Therefore, simulation software has been developed and used for research and study in vehicle movement and the variability of the network performance. Therefore, the realisticness of the simulation software is an ongoing research work. The major problem of the simulation software is the mobility pattern or model of vehicles under study that is non-realistic due to the complexity of the model due the variation of the drivers behavior and sometimes can be such costly that it is impractical for the study. In this research, we proposed a realistic mobility model that is the integration between the statistical analysis and the car following model. By using a real data collection to create a mobility mode based on probability distribution that is integrated to the well-known car following model, the vehicular network simulation study can be more realistic. The results of this study have shown the opportunity to combine such proposed model into a network simulation such as NCTU-ns or ns-2 simulator. Keywords- vhicular network, realistic mobility model, car following models,statistical model

V2V communication consists of moving vehicles with a variety of movement patterns that can be uncertain depending on many factors such as driver behavior, road conditions, and traffic conditions. Thus, the mobility model used to simulate vehicle communication scenarios in conjunction with V2V communication can be erroneous when compared to the realistic system. This paper proposes a solution to reduce the error due to the theoretical mobility model of vehicle movement in the simulation. In order to keep the mobility close to reality as much as possible, we propose using statistical analysis techniques used to analyze the collected data samples to form a statistical model to be integrated with the car following model. The car following model is the movement model that is proposed in transportation engineering approaches that combines driver behavior in multi-lane road infrastructure. The outcome of our proposed model can be used in available simulation software such as NCTU-ns [1], ns-2 [2], or ONE [3] simulation software. II. RELATED WORK

I.

INTRODUCTION

Vehicles as part of our business logistics and every living, the number of vehicles has increased every year, but the road capacity could not be increased at the rate equal to that of the vehicles. This causes many problems such as accidents and traffic jam, and economical loss in term of expense spend for transportation. Another issue is the lack of real-time traffic data that can help to mitigate such problems. Wireless vehicular technology is emerged with the goal to enable the communication between vehicle-to-vehicle (V2V) or vehicleto-infrastructure (V2I). Such technology will allow the information exchange from road devices such as detectors to infrastructure or to vehicles directly for faster response to events such as accidents or emergency rescues in V2I case. In addition, data collected on the site can be used to analyze traffic information can be used in several ways for applications in traffic engineering or for traveler information center (TIC). In V2V case, the traffic information can be broadcast or distributed over a group of vehicles to immediately inform drivers such events.

Many researchers have proposed the studies that involve the mobility models of vehicles on the road that can apply to V2V communication on the roads both in traffic engineering approach and computer engineering approach. Therefore it is necessary to study the mobility models in different forms and different simulation software packages. A. VANET mobility models Several mobility models are proposed for use in vehicular network (VANET) for simulation study such as Freeway model, Manhattan model, and City Section model. Freeway is a map-generated-based model as defined in [4]. This model was tested on the highway or street without a traffic light junction. The movement of vehicles in the traffic will be forced to follow the front vehicles. There is no overtaking or changing lane to traffic. The speed of the vehicle in this model is defined by history-based speed and random acceleration. When the distance between the vehicles that follow the same traffic lane is less than the specified model, the acceleration is negative. The movement pattern of the model is not realistic. The example of vehicle movement in the Freeway model is shown in Figure 1.

19

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

the SUMO simulation software. SUMO [8] is a vehicular traffic simulator that includes the management of the traffic such as traffic light signals or the model of traffic lanes for vehicular communication. The SUMO program can import a real map to analyze traffic and to generate a mobility model based on such traffic. The MOVE program is built with Java and can import a real map file format such as TIGER Line file format or the Google Earth KML file. It can specify the properties of each car such as speed and acceleration. The MOVE operates and interfaces with SUMO to generate a trace file to be used in the network simulation such as ns-2. Figure 4 shows the snapshot of the SUMO and MOVE software. Figure 1: The vehicle movement in the Freeway model Manhattan is also a map-generated-based model introduced in [5] to simulate an urban environment. Each road will have a crossroads. By using the same speed on a freeway model with the increase of the traffic lane, the model allows the lane change at a traffic intersection. The direction of movement of vehicle traffic at the traffic intersection used a probability and the vehicle that has been moved cannot be stopped. The example of the vehicle movement in the Manhattan model is shown in Figure 2.

Figure 3: The pattern of the vehicle location in the City Section Mobility model

Figure 2 The vehicle movement in the Manhattan model. The City Section Mobility model [6] is created by the principle between Random Way Point model and Manhattan model. By adding the pause time and the random selection destination, the model uses the map based on the Manhattan model with the random number of vehicles on the road and the movement to a given destination using the shortest path algorithm. The speed of the movement depends on the distance between moving vehicles and the vehicle ahead and the maximum speed of the road. The downside of this model is the use of the grid-like map that is not as complex as realistic road networks. The example of the City Section Mobility model is shown in Figure 3. B. Simulation Software Package Along with the mobility models are the simulation software for vehicular networks. Several simulation software package offer the use of several mobility models as described in this section. MOVE [7] is a software package to generate a traffic trace based on realistic mobility model for VANET that works with

Figure 4: SUMO and MOVE VanetMobiSim [9] is a modifier of the CANU Mobility Simulation Environment which is based on flexible frameworks for mobility modeling. CANU MobiSim is written in JAVA language and can generate movement traces files in different formats, and supporting different simulation or simulation tools for mobile networks. CANU MobiSim originally includes parsers for maps in the Geographical Data Files (GDF) format and provides implementations of several random mobility models as well as models from physics and vehicular dynamics. The VanetMobiSim is designed for use with vehicular mobility modeling, and features on realistic automotive motion models at both macroscopic and microscopic levels. At the macroscopic level, VanetMobiSim

20

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

can import maps from the TIGER Line database, or randomly generate them using Voronoi tessellation. VanetmobiSim can be support for multi-lane roads, separate directional flows, differentiated speed constraints and traffic signs at intersections. At the microscopic level, VanetMobiSim implements mobility models, providing realistic V2V and V2I interaction. According to these models, vehicles regulate their speed depending on nearby cars, overtake each other and act according to traffic signs in presence of intersections. Figure 5 shows snapshot of the the VanetMobiSim software.

method to change the driving behavior according to those localized parameters.

Figure 6 The example of the car following model III. THE PROPOSED STATISTICAL-BASED CAR FOLLOWING MODEL The proposed design technique for a realistic mobility model for used in a simulation on wireless vehicular network is described in this section. First, the statistical model is present. Due to the lack of the integration of the localized parameter to be taken into account in the calculation of each vehicle speed and acceleration, we proposed a statistical model of each road that is created based on the collected data on that road as a representative for the calculation in the car following model. Figure 7 shows the block process of our proposed work for the integration with the car following model. Figure 5: The snapshot of the VanetMobiSim software C. The Car Following Model In the car following models, the behavior of each driver is described in relation to the vehicle ahead. With regard to each single car as an independent entity, the car following model falls into the category of the microscopic level mobility model. Each vehicles in the car following model computes its speed or its acceleration as a function of the factors such as the distance to the front car, the current speed of both vehicles, and the availability of the side lane. Figure 6 shows the example of the vehicle movement calculation of the car following model. Each vehicle is assigned its lane, i or j, and its location on the road segment. At each time slot, each vehicles speed, acceleration, and lane change probability is calculated based on the microscopic view of the road network and traffic. As a result, each vehicle may increase or decrease the speed and/or the acceleration. In addition, the vehicle may overtake or change the lane with the assign probability when the side lanes are available. The speed and the acceleration of the vehicle keep changing based on the conditions happening during the simulation such as car stop, congestion, or traffic stop light at a crossroad. The change is totally based on the random model without any integration of road structure properties such as the lane narrowness that can affect the speed of vehicles. In addition, as roads are connected in a network, they are different in types, structures and sometimes driving culture based on each country or city that can be classified as localized parameters. Therefore, the car following model is a realistic model that is suitable for vehicular network simulation, but it lacks of an adaptive Based on the variety of the vehicle speed data that is collected from different road types, we analyze the data and find a representative of such data set as a probability function. The probability function is used to generate vehicle speed data at a specific period based on the ID of the road on the map (Map ID). The outcome of the function, which is the speed, is used as an input to the car following model to calculate the speed, the acceleration and the lane change probability during such period. The variation of the length of the period is the tradeoff between the additional workload on the simulation and the realisticness of the model. Figure 8 shows the process of data collection on each specific road for our study. We use a Nokia mobile phone with our written software running on Symbian OS to collect data such as the current location of the vehicle and the current vehicle speed. The locating devices used in our data collection is the global positioning device (GPS) that is connected to the mobile phone via Bluetooth connection. Several research studies in the movement pattern of vehicles are to study the effect of the types of the roads by collecting real data for road traffic verification and validation. The road type that is currently in use can be divided into expressway or freeway road system, major or arterial road, collector road and local road.

Figure 7 The proposed statistical based car following model

21

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

the expressway where there is no traffic light and wide lanes. The peak speed from this collection is around 110 km/hr.

Figure 8 The data collection process for road specific information After we collect the data on different road types, we use a statistical method for curve-fitting the collected data. The distribution function as the result of the curve fitting is then verified and validated. Once, we have the distribution function, we implement the function into the calculation of the car following model specific of those road types. IV. MEASUREMENTS AND RESULTS

Figure 10 The PDF function of the speed on the Expressway. B. The Model of Major Road For the major road data collection, we chose the Vibhavadi road, which is a straight and long road with two or three traffic lane for each direction. Although the road has no traffic light, a temporary stop may occur due to the density of the cars or the results of the truck movement. We collect each speed sample at every 2 second, and totally we have 1227 values in the collection. Figure 11 shows the result of the curve fitting method over the collected data. It is shown that the speed distribution on the major road is also the Weibull distribution. However, the parameters of the function are different. For example, the peak speed of the major is around 65 km/hr that is much less than that of the express way.

We collect traffic data, the vehicle location and the speed, based on four types of the roads, the express way, the major road, the collector road and the local road. These types of roads have different properties that can affect the behavior of the drivers. Then, we proposed a statistical model for each road type based on the curve fitting method. A. The Model of Expressway The sample of the express way that we used to collect data in Bangkok is the Ngamwongwan-Chiang Rak road. We collect the speed data at a sampling time of 2 second for each sample. The location samples of the vehicle running on the Ngamwongwan road are shown in yellow dots in Figure 9.

Figure 11 The PDF function of the speed on the major road. C. The Model of Collector Road In this collection, we chose the Charansanitwong road as our example of the collector road. The road has two lanes for each direction. It has a few traffic light with a short time stop. The samples are collected every 2 second, and totally we have 511 values of the data. Figure 12 shows the result of the curve fitting method on the collected data. It is shown that the result of the curve fitting is the Gamma distribution function. Due to the traffic light, vehicles may stop and slow down. As a result,

Figure 9 The location of vehicles on the Expressway road Totally, we collect about 786 samples of speed data. The result of curve fitting of the collected speed data on the expressway road is shown in Figure 10. The statistical model that fits the distribution of the speed data is the Weibull function where most vehicle speed is high due to the nature of

22

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

most of the speed samples are in the lower section that is result in the Gamma function.

Figure 13 The PDF function of the speed on the local road. V. CONCLUSION Figure 12 The PDF function of the speed on the collector road. D. The Model of Local Road In Bangkok, many local roads are used to connect between major and collector road. Vehicles in local road tend to be stopped at a traffic light for a longer period due to its low priority for traffic light management. In addition, it has many traffic lights and many non-traffic light junction as well as car parking alongside the road, and those tend to slow down the traffic. For our data collection, we chose the Prachacheun road since the road is straight, but has many traffic light. It also crosses many major road such as Tivanon Road and Vibhavadi Road. We collect each data sample at every 2 second, and totally we have 382 values. The result of the curve fitting model of the data collected on the local road is shown in Figure 13. The distribution of the collected data can be represented as Gamma distribution similar to that of the collector. This is due to the nature of the road with traffic lights where vehicles may be stopped or slowed down. However, the average vehicle speed of the local road from the distribution is much lower than that of the collector road. From the results, it is shown that the distribution function of the speed of vehicles on different types of roads can be different. The speed distribution function is not uniformly random as assumed in several mobility model for network simulation. The distribution of roads tends to be either Weibull or Gamma distribution function. The speed distribution of the road with no traffic light is probably a Weibull distribution while that of the road with traffic light is probably a Gamma distribution. The average speed of each road type tends to be affected by the number of lanes and some other properties such as the narrowness of the available lane that can be narrowed by parking cars along the road. In addition, the average speed of the road can be unique to that road. Further studies are required to find the identity or the fingerprint of such road. In this paper, we proposed a statistical-based car following model concept that integrates the uniqueness of each road type into the calculation of the speed of the car following model. The uniqueness of the road can occur due to the different road structure that is very specific to each area of the road networks. The behavior of drivers on each road types can be different. We have found that the road types such as the expressway or the major road tend to have a speed distribution function as a Weibull distribution, but with the different average speed. However, for the collector road type of the local road type, the speed distribution function tends to be a Gamma distribution, but with the different average speed. This may be concluded that the speed of the road without traffic light may be modeled as the Weibull distribution, and that of the road with traffic light may be modeled as the Gamme distribution. However, the average speed of the distribution may be varies based on the nature of such roads. Further research is necessary to investigate into details of traffic data. REFERENCES
[1] NCTU-ns simulation software, available at http://nsl.csie.nctu.edu.tw/nctuns.html, last accessed date: 30/1/2012. [2] The ns-2 simulation software, available at http://www.isi.edu/nsnam/ns/, last accessed date: 30/1/2012. [3] The ONE simulation software, available at http://www.netlab.tkk.fi/tutkimus/dtn/theone/, last accessed date: 30/1/2012. [4] F. Bai, N. Sadagopan, A. Helmy, Important: a framework to systematically analyze the impact of mobility on performance of routing protocols for ad hoc networks, in Proc. 22th IEEE Annual Joint Conference on Computer Communications and Networking INFOCOM03, 2003, pp. 825-835. [5] V. Davies, Evaluating mobility models within an ad hoc network, Colorado School of Mines, Colorado, USA, Tech. Rep., 2000. [6] F. K. Karnadi, Z. H. Mo, K. Chan Lan, Rapid generation of realistic mobility models for VANET, in Proc. IEEE Wireless Communications and Networking Conference, 2007,pp. 2506-2511. [7] MOVE mobility model, available at http://lens.csie.ncku.edu.tw/Joomla_version/index.php/researchprojects/past/18-rapid-vanet, last accessed date: 30/1/2012. [8] SUMO Simulation of Urban Mobility, available at http://sumo.sourceforge.net/, last accessed date: 30/1/2012. [9] J. Harri, F. Filali, C. Bonnet, and Marco Fiore. Vanetmobisim: Generating Realistic Mobility Patterns for VANETs. In VANET: '06: Proceedings of the 3rd international workshop on Vehicular ad-hoc networks, pages 96-97, ACM Press, 2006

23

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Rainfall Prediction in the Northeast Region of Thailand using Cooperative Neuro-Fuzzy Technique
Jesada Kajornrit 1, Kok Wai Wong2, Chun Che Fung3
School of Information Technology, Murdoch University South Street, Murdoch, Western Australia, 6150 Email: j_kajornrit@hotmail.com1, k.wong@murdoch.edu.au2, l.fung@murdoch.edu.au3 Inference System (M-FIS), which use the concept of cooperative neuro-fuzzy technique. This paper is organized as follows; Section 2 discusses the related works and Section 3 describes the case study area. Input identification and the proposed models are presented in Sections 4 and 5 respectively. Section 6 shows the experimental results. Finally, Section 7 provides the conclusion of this paper. II. SOFT COMPUTING TECHNIQUES IN HYDROLOGICAL TIME
SERIES PREDICTION

AbstractAccurate rainfall forecasting is a crucial task for reservoir operation and flood prevention because it can provide an extension of lead-time for flow forecasting. This study proposes two rainfall time series prediction models, the Single Fuzzy Inference System and the Modular Fuzzy Inference System, which use the concept of cooperative neuro-fuzzy technique. This case study is located in the northeast region of Thailand and the proposed models are evaluated by four monthly rainfall time series data. The experimental results showed that the proposed models could be a good alternative method to provide both accurate results and human-understandable prediction mechanism. Furthermore, this study found that when the number of training data was small, the proposed model provided better prediction accuracy than artificial neural networks. Keywords-Rainfall Prediction; Seasonal Time Series; Artificial Neural Networks; Fuzzy Inference System; Average-Based Interval.

I. INTRODUCTION Rainfall forecasting is indispensable for water management because it can provide an extension of lead-time for flow forecasting used in water strategic planning. This is especially important when it is used in reservoir operation and flood prevention. Usually, rainfall time series prediction has used conventional statistical models and Artificial Neural Networks (ANN) [8]. However, such models are difficult to be interpreted by human analysts, because the prediction mechanism is in parametric form. From a hydrologists point of view, the accuracy of prediction and an understanding in the prediction mechanism are equally important. Fuzzy Inference System (FIS) uses the process of mapping from a given set of inputs variables to outputs based on a set of human understandable fuzzy rules [19]. In the last decades, FIS has been successfully applied to various problems [3], [4]. An advantage of FIS is that its decision mechanism is interpretable. As fuzzy rules are closer to human reasoning, an analyst could understand how the model performs the prediction. If necessary, the analyst could also make use of his/her knowledge to modify the prediction model [5]. However, the disadvantage of FIS is its lack of learning ability from the given data. In contrast, an ANN is capable of adapting itself from training data. In many cases where human understanding in physical process is not clear, ANN has been used to learn the relationship between the observing data [6]. However, the disadvantage of ANN is its black-box nature, which is difficult to be interpreted. In order to combine the advantages of both models, this paper propose two rainfall time series prediction models, the Single Fuzzy Inference System (S-FIS) and the Modular Fuzzy

In the hydrological discipline, rainfall prediction is relatively difficult than other climate variables such as temperature. This is due to the highly stochastic nature in rainfall, which shows a lower degree of spatial and temporal variability. To address this challenge, ANN has been adopted in the past decades. For example, Coulibaly and Evora [7] compared six different ANNs to predict daily rainfall data. Among different types of ANN, they suggested that the Multilayer Perceptron, the Time-lagged Feedforward Network, and the Counterpropagation Fuzzy-Neural Network provided higher accuracy than the Generalized Radial Basis Function Network, the Recurrent Neural Network and the Time Delay Recurrent Neural Network. Another work was Wu et al. [8]. They proposed the use of data-driven models with data preprocessing techniques to predict precipitation data in daily and monthly scale. They proposed three preprocessing techniques, namely, Moving Average, Principle Component Analysis and Singular Spectrum Analysis to smoothen the time series data. Somvanshi et al. [1] confirmed in their work that ANN provided better accuracy than ARIMA model for daily rainfall time series prediction. Time series prediction is not only used for rainfall data but also streamflow and rainfall-runoff modeling. Wang et al. [9] compared several computational models, namely, AutoRegressive Moving Average (ARMA), ANN, Adaptive NeuralFuzzy Inference System (ANFIS), Genetic Programming (GP) and Support Vector Machine (SVM) to predict monthly discharge time series. Their results indicated that ANFIS, GP and SVM have provided the best performance. Lohani [10] compared ANN, FIS and linear transfer model for daily rainfallrunoff model under different input domains. The results also showed that FIS outperformed linear model and ANN. Nayak et al. [11] and Kermani et al. [12] proposed the use of ANFIS model to river flow time series. In addition, Jain and Kumar [13] applied conventional preprocessing approaches (detrended and de-seasonalized) to ANN for streamflow time series data.

24

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TS356010 TS381010 TS388002 TS407005

(TS356010)

Figure 1. The case study area is located in the northeast region of Thailand. The positions of four rainfall stations are illustrated by star marks.

Up to this point, among all works mentioned, FIS itself has not been used as widely as ANN for time series prediction. Especially for rainfall time series prediction, reports on applications of FIS are limited. Thus, the primary aim of this study is to investigate an appropriate way to use FIS for rainfall time series prediction problem.
(TS381010)

III. CASE STUDY AREA AND DATA The case study described in this study is located at the northeast region of Thailand (Fig 1). Four rainfall time series selected are depicted in Fig 2. Table 1 shows the statistics of the datasets used. The data from 1981 to 1998 were used to calibrate the models and data from 1999 to 2001 were used to validate the developed models. This study used the models to predict one step-ahead, that is, one month. To validate the models, Mean Absolute Error (MAE) is adopted as given in equation (1). The Coefficient of Fit (R) is also used to confirm the results. The performance of the proposed model is compared with conventional Box-Jenkins (BJ) models, Autoregressive (AR), Autoregressive Integrated Moving Average (ARIMA) and Seasonal Autoregressive Integrated Moving Average (SARIMA) [1], [8], [10], [13] and [15]. =
=1

(TS388002)

(1)

TABLE I. Statistics Mean SD Kurtosis Skewness Minimum Maximum Latitude Longitude Altitude TS356010 1303.34 1382.98 -0.10 0.95 0 5099 104.13E 17.15N 176

DATASETS STATISTICS TS381010 889.04 922.99 0.808 1.080 0 4704 102.88E 16.66N 164 TS388002 1286.28 1425.88 0.532 1.131 0 6117 104.05E 16.65N 155 TS407005 1319.70 1346.80 -0.224 0.825 0 5519 104.75E 15.50N 129

(TS407005) Figure 2. The four selected monthly rainfall time series used in this study.

25

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

IV. INPUT IDENTIFICATION In general, input of a time series model are normally based on previous data points (Lags). For BJ models, the analysis of autocorrelation function (ACF) and partial autocorrelation function (PACF) are used as a guide to identify the appropriate input. However, in the case of ANN or other related non-linear models, there was no theory to support the use of these functions [14]. Although some literatures addressed the applicability of ACF and PACF to non-linear models [15], other literatures preferred to conduct experiments to identify the appropriate input [11]. This study conducted an experiment to find an appropriate input based on data from five rainfall stations. Data from 1981 to 1995 were used for calibration and data from 1996 to 1998 were used for validation. By increasing the number of lags to ANNs, six different inputs models were prepared and tested. To predict x(t), first input model is x(t-1), second input model is x(t-1), x(t-2) and so on. Fig 3 shows the results from the experiment. In this figure, average normalized MAEs from five time series are illustrated in bold line. The results show that the MAE is the lowest at lag 5. The Five previous lags model is expected to be an appropriate input. Since increasing the number of input lags dose not significantly improve the prediction performance, additional methods may be needed. In the case of seasonal data, there are other methods to identify an appropriate input to improve the prediction accuracy, for examples, using the Phase Space Reconstruction (PSR) [16] and adding time coefficient as a supplementary feature [2]. However, in the first method, large number of training data is needed. According to The Curse of Dimensionality, when the number of input dimensions increases, the number of training data must be increased as well [17]. In this case study, the number of record is limited to 15 years, which could be considered as relatively small. Therefore it is more appropriate to add the time coefficient. Time coefficient (Ct) was used to assist the model to scope prediction into specific period. It may be Ct = 2 (wet and dry period), Ct = 4 (winter spring summer and fall period), or Ct = 12 (calendar months). This study adopted Ct = 12 as supplementary features. In Fig 3, Ct is added to original input data and test with ANNs (light line). The results show that using Ct with 2 previous lags provided the lowest average MAE and it can improve the prediction performance up to 26% (dash line). So, the appropriate input used in this study should be rainfall from lag 1, lag 2 and Ct. This experimental result is related to the work of Raman and Sunilkumar [18] who studied monthly inflow time series. In hydrological process, inflow is directly affected by rainfall, consequently, the characteristics of flow graph and rainfall graph are rather similar. They suggested using data from 2 previous lags to ANN models, however, instead of using a single ANN, they created twelve ANN models for each specific month and use month to select associated model to feed data in. If one considers this model as a black-box, one can see that their input is inflow from 2 previous lags and Ct which relatively similar to this study
Figure 3. Average MAE measure of ANN models among different

inputs.

V. THE PROPOSED MODELS This paper adopted the Mandani approach fuzzy inference system [20] since such model is more intuitive than the Sugeno approach [21]. To reduce the computational cost, triangular Membership Function (MF) is used. This study proposed two FIS models, namely, the Single Fuzzy Inference System (SFIS) and the Modular Fuzzy Inference System (M-FIS), which use the concept of cooperative neuro-fuzzy technique. In S-FIS model, there is one single FIS model. Rainfall data from lag 1, lag 2 and Ct are feed directly in to the model. In M-FIS model, there are twelve FIS models associated to the calendar month. The Ct is used to select associated model to feed in the rainfall data from lag 1 and lag 2. The architectural overview of these two models is shown in the Fig 4. Fig 5 shows the general steps to create these FIS models. The first step is to calculate the appropriate interval length between two consecutive MFs and then generate Mamdani FIS rule base model. At this step, Average-Based Interval is adopted. The second step is to create fuzzy rules. In this study, Back-Propagation Neural Networks (BPNN) is used to generalize from the training data and then used to extract fuzzy rules.

Ct Xt-1 Xt-2

Xt

Xt-1 Xt-2

Xt

Ct
Figure 4. The architectural overview of the S-FIS (top) and M-FIS (bottom) models

26

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Training data

Calculate Average-Based Interval Length

Generate FIS Rule Base and its MFs

Fig 6 (b) shows the rainfalls MFs of S-FIS from station TS356010. One can see that there are two interval lengths. The point that the interval length changes is around the 50 percentile of all the data. The data is separated into the lower area and the upper area by using 50 percentile as the boundary. Average-based intervals are calculated for both areas. Since the beginning and ending rainfall periods have smaller fluctuation than middle period, using smaller interval length is more appropriate [2]. In the M-FIS model, using two interval lengths is not necessary since each sub model is created according to the specific month. As mentioned before, the drawback of FIS is the lack of learning ability from data. Such model needs experts or other supplementary procedure to help to create the fuzzy rules. In this study, the proposed methodology uses BPNN to learn the generalization features from the training data [5] and then is used to extract fuzzy rules. Once the BPNN was used to extract fuzzy rules, BPNN is not used anymore. The steps to create fuzzy rules are as follows: Step 1: Training the BPNN with the training data. At this step, the BPNN is learned and generalized from the training data. Step 2: Preparing the set of input data. The set of input data, in this case, are all the points in the input space where the degree of MF of FISs input is 1 in all dimension. This input data are the premise part of the fuzzy rules. Step 3: Feeding the input data into the BPNN, the output of BPNN are mapped to the nearest MF of FISs output. This output data are consequence part of the fuzzy rule. For example, considering the MFs in Figure 6, the input-output [3, 500, 750:1700] is replaced with fuzzy rule IF Ct=Mar and Lag1=A3 and Lag2=A4 THEN Predicted=A6. This step uses 1 hidden layer BPNN. The number of hidden nodes and input nodes are 3 for S-FIS and 2 for M-FIS. VI. EXPERIMENTAL RESULTS The experimental results are shown in Table 2 and Table 3. In the tables, S-ANN and M-ANN are the neural networks used to create fuzzy rules for S-FIS and M-FIS respectively. In fact, the S-ANN and M-ANN themselves are also the prediction models. The performance between S-ANN and S-FIS is quite similar. It can be noted that the conversion from ANN-based to FIS-based does not reduce the prediction performance of the ANN. However, this conversion improves the S-ANN model from a qualitative point of view since M-FIS is interpretable with a set of human understandable fuzzy rules. The interesting point is the performance between M-ANN and M-FIS. This conversion can improve the performance of M-ANN. Next, the proposed models have been compared with three conventional BJ models. The comparison results are depicted in Fig 7. Since the results from MAE and R measures are consolidated, these experimental results are rather consistent. Similar to the work by Raman and Sunilkumar [18], the AR model uses degree 2 because it uses the same input as the proposed models. The ARIMA and SARIMA models used in the study are automatically generated and optimized by statistical software. However, these generated models were also rechecked to ensure that they provided the best accuracy.

Generate Fuzzy Rules

Train BPNN

FIS model

Figure 5. General steps to crate the S-FIS and M-FIS models

In the S-FIS model, the MFs of Ct are simply depicted in Fig 6 (a). For rainfall input, interval length between two consecutive MFs is very important to be defined. When the length of the interval is too large, it may not be able to represent fluctuation in time series. On the other hand, when it is too small the objective of FIS will be diminished. Huarng [22] proposed the Average-Based Interval to define the appropriate interval length of MFs for fuzzy time series data based on the concept that at least half of the fluctuations in the time series are reflected by the effective length of interval. The fluctuation in time series data is the absolute value of first difference of any two consecutive data. In this method, a half of the average value of all fluctuation in time series is defined as the interval length of consecutive two MFs. This method was successfully applied in the work reported in [23]. In this paper, this method is adapted a little bit more to fit to the nature of rainfall time series for this application.

jan

feb

mar

apr

may

jun

jul

agu

sep

oct

nov

dec

Degree of membership

0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 12

(a)
A1 1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13

Degree of membership

0.8 0.6 0.4 0.2 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

(b)

Figure 6. An example of membership functions in TS356010s S-FIS model, Ct (a) and Rainfall (b)

27

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE II. Datasets TS356010 TS381010 TS388002 TS407005 S-ANN 450.99 332.71 736.70 636.37 S-FIS 447.56 343.88 725.39 634.65

MAE MEASURE OF VALIDATION PERIOD M-ANN 560.44 439.91 811.99 776.63 M-FIS 496.35 442.32 639.29 661.30 AR 747.37 534.32 912.64 901.76 ARIMA 747.01 402.42 856.88 672.35 SARIMA 538.99 503.99 714.74 799.34

TABLE III. Datasets TS356010 TS381010 TS388002 TS407005 S-ANN 0.884 0.719 0.760 0.768 S-FIS 0.887 0.709 0.773 0.770

R MEASURE OF VALIDATION PERIOD M-ANN 0.755 0.606 0.712 0.633 M-FIS 0.850 0.668 0.871 0.736 AR 0.650 0.464 0.606 0.594 ARIMA 0.759 0.733 0.685 0.755 SARIMA 0.837 0.575 0.769 0.681

In term of MAE, among the three BJ models, the AR model provided the lowest accuracy in all datasets. ARIMA show higher accuracy than SARIMA in two of the datasets. In station TS356010 and TS407005 the proposed model shows higher performance than all BJ models, especially the S-FIS model. In station TS381010, the ARIMA model is better than M-FIS but the performance is lower than S-FIS. In station TS388002, SARIMA model showed better performance than S-FIS but lower than M-FIS. The average normalized MAE and average R measure from all datasets are shown in the Fig 8. It can be seen from the figure that, overall, the proposed models performed better than the results generated from AR, ARIMA and SARIMA model. All aforementioned results are based on quantitative point of view in order to validate the experimental results. In qualitative point of view, the proposed model is easier to interpret than other models because the decision mechanism of such models is in the fuzzy rules form which is close to human reasoning [5]. Furthermore, when the models are in the form of rule base, it is easier for further enhancement and optimization by human expert. The advantage of S-FIS model is that time coefficient is expressed in term of MFs, so it is possible to apply optimization method to this feature. However, a large number of fuzzy rules are needed for single model. On the other hand, M-FIS model has smaller number of fuzzy rules when compared to S-FIS, but such model does not use any time feature. VII. CONCLUSION Accurate rainfall forecasting is crucial for reservoir operation and flood prevention because it can provide an extension of lead-time of the flow forecasting and many time series prediction models have been applied. However, the prediction mechanism of those models may be difficult to be interpreted by human analysts. This study proposed the Single Fuzzy Inference System and the Modular Fuzzy Inference System, which use the concept of cooperative neuro-fuzzy technique to predict monthly rainfall time series in the northeast region of Thailand. The reported models used the average-based interval method

to determine the fuzzy interval and use BPNN to extract fuzzy rules. The prediction performance of the proposed models is compared with conventional Box-Jenkins models. The experimental results showed that the proposed models could be a good alternative. Furthermore, the prediction mechanism can be interpreted through the human understandable fuzzy rules.

(a)

(b) Figure 7. The comparison performance between the purposed models and conventional Box-Jenkins models: MAE (a) and R (b).

28

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

(a)

(b)

Figure 8. The average normalized MAE (a) and average R (b) of all datasets

REFERENCES
[1] V. K. Somvanshi, et al., Modeling and prediction of rainfall using artificial neural network and ARIMA techniques. J. Ind. Geophys. Union, vol. 10, no. 2, pp. 141-151, 2006. [2] Z. F. Toprak, et al., Modeling monthly mean flow in a poorly gauged basin by fuzzy logic, Clean, vol. 37, no. 7, pp. 555-567, 2009. [3] S. Kato and K. W. Wong, Intelligent Automated Guided Vehicle with Reverse Strategy: A Comparison Study, in Mario Kppen, Nikola K. Kasabov, George G. Coghill (Eds.) Advances in Neuro-Information Processing, Lecture Notes in Computer Science, Springer-Verlag, Berlin Heidelberg, pp. 638646, 2009. [4] K. W. Wong, and T. D. Gedeon, "Petrophysical Properties Prediction Using Self-generating Fuzzy Rules Inference System with Modified Alpha-cut Based Fuzzy Interpolation", Proceedings of The Seventh International Conference of Neural Information Processing ICONIP, pp. 1088-1092, November 2000, Korea. [5] K. W. Wong, P. M. Wong, T. D. Gedeon, C. C. Fung, Rainfall Prediction Model Using Soft Computing Technique, Soft Computing, vol 7, issue 6, pp. 434-438, 2003. [6] C. C. Fung, K. W. Wong, H. Eren, R. Charlebois, and H. Crocker, Modular Artificial Neural Network for Prediction of Petrophysical Properties from Well Log Data, in IEEE Transactions on Instrumentation & Measurement, 46(6), December, pp.12591263, 1997.

[7] P. Coulibaly and N. D. Evora, Comparison of neural network methods for infilling missing daily weather records. Journal of Hydrology, vol. 341 pp. 27-41, 2007. [8] C. L. Wu, K. W. Chau, and C. Fan, Prediction of rainfall time series using modular artificial neural networks coupled with data-preprocessing techniques. Journal of Hydrology, vol. 389, pp.146-167, 2010. [9] W. Wang, K. Chau, C. Cheng and L. Qiu, A comparison of performance of several artificial intelligence methods for forecasting monthly discharge time series. Journal of Hydrology, vol. 374, pp. 294-306, 2009. [10] A. K. Lohani, N. K. Goel and K. K. S. Bhatia, Comparative study of neural network, fuzzy logic and linear transfer function techniques in daily rainfall-runoff modeling under different input domains. Hydrological Process, vol. 25, pp. 175-193, 2011. [11] P. C. Nayak, et al., A neuro-fuzzy computing technique for modeling hydrological time series, Journal of Hydrology, vol. 291. pp. 52-66, 2004. [12] M. Z. Kermani and M. Teshnehlab, Using adaptive neuro-fuzzy inference system for hydrological time series prediction. Applied Soft Computing, vol. 8, pp. 928-936, 2008. [13] A. Jain and A. M. Kumar, Hybrid neural network models for hydrologic time series forecasting. Applied Soft Computing, vol. 7, pp. 585-592, 2007. [14] M. Khaashei, M. Bijari, G. A. r. Ardali, Improvement of AutoRegressive Integrated Moving Average models using Fuzzy logic and Artificial Neural Networks (ANNs), Neurocomputing, vol. 72, pp. 956-967, (2009). [15] K. P. Sudheer, A data-driven algorithm for constructing artificial neural network rainfall-runoff models, Hydrological Precess, vol. 16, pp. 1325-1330, (2002). [16] C. L. Wu and K. W. Chau, Data-driven models for monthly streamflow time series prediction. Engineering Applications of Artificial Intelligence, vol. 23, pp. 1350-1367, 2010. [17] S. Marsland, Machine Learning An Algorithmic Perspective CRC Press, 2009. [18] H. Raman and N. Sunilkumar, Multivariate modeling of water resources time series using artificial neural network, Hydrological Sciences Journal- des Sciences Hydroligiques, vol. 40, pp.145-163, 1995. [19] L. A. Zadeh, Fuzzy Sets, Inform and Control, vol. 8, pp. 338 353. 1965. [20] E. H. Mamdani, and S. Assilian, An experiment in linguistic synthesis with fuzzy logic controller, International journal of man-machine studies, vol. 7 no. 1, pp.1-13, 1975. [21] M. Sugeno, Industrial application of fuzzy control, NorthHolland, Amsterdam. 1985. [22] K. Huarng, Effective lengths of intervals to improve forecasting in fuzzy time series, Fuzzy sets and system, vol. 123. Pp. 387-394, 2001. [23] H. Liu and M. Wei, An improved fuzzy forecasting method for seasonal time series, Expert System with Applications, vol. 37, pp. 6310-6318, 2010.

29

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Interval-valued Intuitionistic Fuzzy ELECTRE Method


Ming-Che Wu
Graduate Institute of Business and Management College of Management, Chang Gung University Taoyuan 333, Taiwan richwu50@gmail.com

Ting-Yu Chen
Department of Industrial and Business Management College of Management, Chang Gung University Taoyuan 333, Taiwan tychen@mail.cgu.edu.tw

AbstractIn this study, the proposed method replaced the evaluation data from crispy value to vague value, i.e. intervalvalued intuitionistic fuzzy (IVIF) data, and to develop the IVIF Elimination and Choice Translating Reality (ELECTRE) method for solving the multiple criteria decision making problems. The analyst can use IVIF sets characteristics to classify different kinds of concordance (discordance) sets using score and accuracy function, membership uncertainty degree, hesitation uncertainty index and then applied the proposed method to select the better alternatives. Keywords-interval-valued intuitionistic fuzzy; ELECTRE; multiple criteria decision making; score function; accuracy function

sets, and then using the result to rank all alternatives, for solving MCDM problems. The intuitionistic fuzzy set (IFS) was first introduced by Atanassov [1], and the IFS generalize the fuzzy set, which was introduced by Zadeh [11]. The interval-valued intuitionistic fuzzy set (IVIFS), that is combined IFS concept with interval valued fuzzy set concept, introduced by Atanassov and Gargov [2], each of which is characterized by membership function and non-membership function whose values are interval rather than exact numbers, are a very useful means to describe the decision information in the process of decision making. As the literature review shows, few studies have applied the ELECTRE method with IVIFS to real life cases. The main purpose of this paper is to further extend the ELECTRE method to develop a new method to solve MCDM problems in interval-valued intuitionistic fuzzy (IVIF) environments. The major difference between the current study and other available papers is the proposed method, whose logic is simple but which is suitable for the vague of real life situations. The proposed method that also using the score and accuracy function, and added 2 more factors, membership and hesitation uncertainty index, i.e. applied the factors of membership, non-membership functions and hesitancy degree, to distinguish different kinds of concordance and discordance sets, and then to select the best alternatives finally. The remainder of this paper is organized as follows. Section 2 introduces the decision environment with IVIF data, the score, accuracy functions and some indices, and the construction of the IVIF decision matrix. Section 3 introduces the IVIF ELECTRE methods and its algorithm. Section 4 illustrates the proposed method with a numerical example. Section 5 presents the discussion. II. DECISION ENVIRONMENT WITH IVIF DATA

I.

INTRODUCTION

The Elimination et Choice Translating Reality (ELECTRE) method is one of the outranking relation methods and it was first introduced by Roy [3]. The threshold values in the classical ELECTRE method are playing an importance role to filtering alternatives, and different threshold values produce different filtering results. As we known that the evaluation data in classical ELECTRE method are almost exact values that can affect the threshold values. Moreover, in real world cases, exact values could be difficult to be precisely determined since analysts judgments are often vague; for these reasons, we can find some studies [4,5,8] developed the ELECTRE method with type 2 fuzzy data. Vahdani and Hadipour [4] presented a fuzzy ELECTRE method using the concept of the intervalvalued fuzzy set (IVFS) with unequal criteria weights, and the criteria values are considered as triangular interval-valued fuzzy number, and also using triangular interval-valued fuzzy number to distinguish the concordance and discordance sets, and then to solve multi-criteria decision-making (MCDM) problems. Vahdani et al. [5] proposed an ELECTRE method using the concepts of interval weights and data to distinguish the concordance and discordance sets, and then to evaluate a set of alternatives and applied it to the problem of supplier selection. Wu and Chen [8] proposed an intuitionistic fuzzy (IF) ELECTRE method that using the concept of score and accuracy function, i.e. calculated the different combinations of membership, non-membership functions and hesitancy degree, to distinguish different kinds of concordance and discordance

A. Interval-valued intuitionistic fuzzy sets Based on the definition of IVIFS in Atanassov and Gargov study [2], we have: Definition 1: Let X be a non-empty set of the universe, and D [ 0,1] be the set of all closed subintervals of all closed % subintervals of [ 0,1] . An IVIFS A in X is an expression defined by

This research is supported by the National Science Council (No. NSC 99-2410-H-182-022-MY3).

30

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

% %% %% A = x, M A ( x), N A ( x) | x X

= x,[ M L A ( x), M U A ( x)],[ N L A ( x), N U A ( x)] | x X , (1) % % % %


%% %% where M A ( x) : X D[0,1] and N A ( x) : X D[0,1] denote the membership degree and the non-membership degree for %% %% any x X , respectively. M A ( x) and N A ( x) are closed intervals rather than real numbers and their lower and upper boundaries are denoted by M L A ( x) , M U A ( x) , N L A ( x) and % % % N U A ( x) , respectively, and 0 M U A ( x) + N U A ( x) 1 . % % %

B. The score, accuracy functions and some indices The studies reviews of score and accuracy functions to handle multi-criteria fuzzy decision-making problems are as % follows. At definition 1, an IVIFS A in X is defined as % A = x,[ M L % ( x), M U % ( x)],[ N L % ( x), N U % ( x)] | x X , for

{
n

% convenience, we call An = [ M

% An

( x), M

% An

( x)],[ N L A ( x), %
n

N U A ( x)] an interval-valued intuitionistic fuzzy number %

(IVIFN) [10], where [ M L A ( x), M U A ( x)] [0,1] , [ N L A ( x), % % %


n n n

% An

( x)] [0,1] , and M

% An

( x) + N

% An

( x) 1 .

Definition 2: [2] For each element x , the hesitancy degree % of an intuitionistic fuzzy interval of x X in A defined as follows:

Xu [10] defined a score function s to measure the degree of % suitability of an IVIFN An as follows.
1 % s ( An ) = ( M L A ( x) N L A ( x) + M U A ( x) N U A ( x)) , where % % % % n n n n 2 % ) [ 1,1] . The larger the value of s ( A ) , the higher the % s ( An n % degree of the IVIFN A . Wei and Wang [7] defined an
n

%% %% %% A ( x) = 1 M A ( x) N A ( x)
= [1 M U A ( x) N U A ( x),1 M L A ( x) N L A ( x)] % % % %

= [ L A ( x), U A ( x)] . % %

(2)

Definition 3: The operations of IVIFS [2,9] are defined as follows: for two of A, B IVIFS( X ) , (a) A B iff M L A ( x) M L B ( x) , M U A ( x) M U B ( x) and % % % %
N L A ( x) N L B ( x) , N U A ( x) N U B ( x) ; % % % %

% accuracy function h to evaluate the accuracy degree of an An as follows. 1 % h( An ) = ( M L A ( x) + M U A ( x) + N L A ( x) + N U A ( x)) , where % % % % n n n n 2 % ) [0,1] . The larger the value of h( A ) , the higher the % h( An n % degree of the IVIFN A . The membership uncertainty index T
n

(b) A = B iff A B and B A ;

was proposed [6] to evaluate the membership uncertainty % % degree of an IVIFN An as follows. T ( An ) = M U A ( x) + %
n

1 n (c) d1 ( A, B) = [| M L A ( x j ) M L B ( x j ) | + | M U A ( x j ) % % % 4 j =1
M U B ( x j ) | + | N L A ( x j ) N L B ( x j ) | + | N U A ( x j ) % % % %

% N L A ( x) M L A ( x) N U A ( x) , where 1 T ( An ) 1 . The % % %
n n n

% % larger value of T ( An ) , the smaller of the IVIFN An . % The hesitation uncertainty index G of a An is defined as U U L % follows. G ( A ) = M % ( x) + N % ( x) M % ( x) N L % ( x) ,
n An An An An

% ( x j ) |]; B

(d) d 2 ( A, B ) =

1 n L L U [| M A ( x j ) M B ( x j ) | + | M A ( x j ) % % % 4n j =1

% % and the larger value of G ( An ) , the smaller of the IVIFN An .

M U B ( x j ) | + | N L A ( x j ) N L B ( x j ) | + | N U A ( x j ) % % % % N U B ( x j ) |] ; % (e) d3 ( A, B ) = 1 n L L U w j [| M A ( x j ) M B ( x j ) | + | M A ( x j ) % % % 4 j =1

In the study, we classify different types of concordance and discordance sets with the concepts of score, accuracy functions, membership uncertainty and hesitation uncertainty index at the proposed method.

M U B ( x j ) | + | N L A ( x j ) N L B ( x j ) | + % % % | N U A ( x j ) N U B ( x j ) |] , % % (3)

C. Construction of the IVIF decision matrix We extend the canonical matrix format to an IVIF decision % % matrix M . An IVIFS Ai of the ith alternative on X is given by % % , A = x , X | x X
i

ij

where

% X ij = ([ M

% ( x ), M A

L U % ( x)],[ N A ( x), N A ( x)]) % % A

where w j = {w1 , w2 ,...wn } is the weight vector of the elements


x j ( j = 1, 2,..., n) . The d1 ( A, B), d 2 ( A, B)and d3 ( A, B) are the

% The X ij indicate the degrees of membership and nonmembership interval of the ith alternative with respect to the jth % criterion. The IVIF decision matrix M can be expressed as follows:

Hamming distance, normalized Hamming distance, and weighted Hamming distance, respectively.

31

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

% % A1 X11 L X1n % O M M= M M % % Am X m1 L X mn
([ M11L , M11U ],[ N11L , N11U ]) . ([ M1n L , M1nU ],[ N1n L , N1nU ]) = . . . L U L U L U L U ([ M m1 , M m1 ],[ N m1 , N m1 ]) . ([ M mn , M mn ],[ N mn , N mn ])

where Ckl = {C1kl , C 2 kl , C 3kl , C 4 kl } , J = { j | j = 1, 2,..., n} , and % % X , X stand for the lower and upper boundaries of alternative
kj lj

k and l in criterion j, respectively.


% % % The s ( X kj ) , h( X kj ) and T ( X kj ) are score, accuracy fun-

ction and membership uncertainty index, respectively, which are defined in section II. B. Definition 5: The discordance set Dkl is defined as
L D1kl = { M xkj N L kj + M U kj N U kj < j

(4)

An IVIFS W , a set of grades of importance, in X is defined as follows:


W = x j , w j ( x j ) | x j X ,
n

(5)

M Llj N L lj +M U lj N U lj },
L D 2 kl = { M xkj + M U kj + N L kj + N U kj < j

(10)

where 0 w j ( x j ) 1 , w j ( x j ) = 1 , and w j ( x j ) is the degree


j =1

M Llj + M U lj +N L lj + N U lj }
% % when s ( X kj ) = s( X lj ) ,

of importance assigned to each criterion. III. ELECTRE METHOD WITH IVIF DATA

(11)

The proposed method is utilized the concept of score and accuracy function to distinguish concordance set and the discordance set from the evaluation information with IVIFS data, and then to construct the concordance, discordance, concordance (discordance, aggregate) dominance matrix, respectively, and to select the best alternative from the aggregate dominance matrix finally. In this section, the IVIF ELECTRE method and its algorithm are introduced and used throughout this paper.

D3kl = { M U kj + N L kj M L kj N U kj > j x M U lj +N Llj M Llj N U lj }


% % when h( X kj ) = h( X lj ) ,

(12)

D 4 kl = { M U kj + N U kj M L kj N L kj > j x M U lj + N U lj M L lj N L lj }
% % when T ( X kj ) = T ( X lj ) ,

A. The IVIF ELECTRE method The concordance and discordance sets with IVIF data and their definitions are as follows.
Definition 4: The concordance set Ckl is defined as
L C1kl = { M xkj N L kj + M U kj N U kj > j

(13)

where Dkl = {D1kl , D 2 kl , D 3kl , D 4 kl } . The relative value of the concordance set of the IVIF ELECTRE method is measured through the concordance index. The concordance index g kl between Ak and Al is defined as:

M Llj N Llj +M U lj N U lj },
L C 2 kl = { M xkj + M U kj + N L kj + N U kj > j

(6)

g kl = C w j ( x j ) ,
jCkl

(14)

where C is the weight of the concordance set, and w j ( x j ) is defined in (5). The concordance matrix G is defined as follows:
(7)
g21 G = ... g( m 1)1 g m1 g12 ... ... gm 2 ... g23 ... ... ... ... ... gm ( m 1) g1m g2 m ... , g( m 1) m

L lj

+M

lj +N lj

+N

U lj

% % when s ( X kj ) = s( X lj ) ,

C 3kl = { M U kj + N L kj M L kj N U kj < j x M
U lj +N lj L

(15)

L lj

lj }

% % when h( X kj ) = h( X lj ) ,

(8)

C 4 kl = { M U kj + N U kj M L kj N L kj j x M U lj + N U lj M Llj N L lj }
% % when T ( X kj ) = T ( X lj ) ,

where the maximum value of g kl is denoted by g * . The evaluation of a certain Ak are worse than the evaluation of competing Al . (9) The discordance index is defined as follows:

32

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

hkl =

max D d (X kj , X lj )
jDkl

max d (X kj , X lj )
jJ

(16)

where d (X kj , X lj ) is defined in (3), and D is the weights of discordance set on IVIF ELECTRE method. The discordance matrix H is defined as follows:
h 21 H = ... h( m 1)1 h m1 h12 ... ... hm 2 ... h23 ... ... ... ... ... hm ( m 1) h1m h2 m ... , h( m 1) m

kkl and lkl are defined in (18) and (19), and rkl is in the range from 0 to 1. A higher value of rkl indicates that the alternative Ak is more concordant than the alternative Al ; thus, it is a better alternative. In the best alternatives selection process,

Tk =

m 1 rkl , k = 1, 2,..., m , m 1 l =1,l k

(22)

and T k is the final value of the evaluation. All alternatives can be ranked according to the value of T k . The best alternative

(17)

A * with T k can be generated and defined as follows: T k ( A*) = max{T k } ,


*
*

(23)

where the maximum value of hkl is denoted by h* that is more discordant than the other cases. The concordance dominance matrix K is defined as follows: k 21 K = ... k( m 1)1 k m1 k12 ... ... km 2 ... k23 ... ... ... ... ... km ( m 1) k1m k2 m ... , k( m 1) m

where T k is the final value of the best alternative and A * is the best alternative.
B. Algorithm The algorithm and decision process of the IVIF ELECTRE method can be summarized in the following four steps, and there are calculate the concordance, discordance matrices, construct the concordance dominance, discordance dominance matrices and determine the aggregate dominance matrix in the Step 3. Figure 1 illustrates a conceptual model of the proposed method.

(18)

where kkl = g * g kl , and a higher value of kkl indicates that Ak is less favorable than Al . The discordance dominance matrix L is defined as follows: l 21 L = ... l( m 1)1 l m1 l12 ... ... lm 2 ... l23 ... ... ... ... ... lm ( m 1) l1m l2 m ... , l( m 1) m

1.Construct the decision matrix

Using (4), (5)

2.Identify the concordance and discordance sets

Using (6)-(13)

(19)

3.Calculate the matrices

Using (14)-(21)

where lkl = h* hkl , a higher value of lkl indicates that Ak is preferred over Al . The aggregate dominance matrix R is defined as follows: r 21 R = ... r( m 1)1 r m1 where r12 ... ... rm 2 ... r23 ... ... ... ... ... rm ( m 1) r1m r2 m ... , r( m 1) m

4.Choose the best alternative

Using (22),(23)

Figure 1. The process of the IVIF ELECTRE method algorithm.

IV.
(20)

NUMERICAL EXAMPLE

rkl =

lkl , kkl + lkl

(21)

In this section, we present an example that is connected to a decision-making problem with the best alternative selection. Suppose a potential banker intends to invest the money from four possible alternatives (companies), named A1, A2, A3, and A4. The criteria of a company is x1 (risk analysis), x2 (the growth analysis), and x3 (the environmental impact analysis) in the selection problem. The subjective importance levels of the different criteria W are given by the decision makers:

33

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

W = [ w1 , w2 , w3 ] = [0.35, 0.25, 0.4] . The decision makers also give the relative weights as follows:
W ' = [ wC , wD ] = [1,1] . The IVIFS decision matrix decision

h12 =

jD12

max wD d (X 1 j , X 2 j ) max d (X 1 j , X 2 j )
jJ

0.100 = 0.267 , 0.375

% M is given with cardinal information:


([ M11L , M11U ],[ N11L , N11U ]) . ([ M1n L , M1nU ],[ N1n L , N1nU ]) % = M . . . ([ M m1L , M m1U ],[ N m1L , N m1U ]) . ([ M mn L , M mnU ],[ N mn L , N mnU ])

where
d (X 13 , X 23 ) =

1 ( 0.1 0.4 + 0.3 0.7 + 4

0.5 0.1 + 0.6 0.2 ) = 0.375 , and


1 wD d (X 12 , X 22 ) = 1 ( ( 0.4 0.6 + 0.6 0.7 4 + 0.2 0.2 + 0.4 0.3 )) = 0.100 .

([0.4, 0.5],[0.3, 0.4]) ([0.4, 0.6],[0.2, 0.3]) = ([0.3, 0.6],[0.3, 0.4]) ([0.7, 0.8],[0.1, 0.2])

([0.4, 0.6],[0.2, 0.4]) ([0.1, 0.3],[0.5, 0.6]) ([0.6, 0.7],[0.2, 0.3]) ([0.4, 0.7],[0.1, 0.2]) ([0.5, 0.6],[0.3, 0.4]) ([0.5, 0.6],[0.1, 0.3]) ([0.6, 0.7],[0.1, 0.3]) ([0.3, 0.4],[0.1, 0.2])

( Step 1 has completed. ) Applying Step 2, the concordance and discordance sets are identified using the result of Step 1. The concordance set, applying (6) - (9), is:
1, 3 1, 3 1, 3 1, 2, 3 1, 2, 3 2, 3 . = 2, 3 1, 2, 3 2, 3 1, 2, 3 1, 2, 3 1, 2, 3

The concordance dominance matrix is constructed as follows.


0.2 0.2 0.2 0 0 0. 5 . K = 0. 5 0 0. 5 0 0 0

C kl

For example, C 24 , which is in the 2nd (horizontal) row and the 4th (vertical) column of the concordance set, are 2,3. The discordance set, obtained by applying (10) - (13), is as follows:
Dkl = 1 2 2 2 1 . 1

The discordance dominance matrix is constructed as follows.


0.733 0.857 0.643 1 1 0 . L= 0.857 1 0 1 1 1

The aggregate dominance matrix is determined:


0.786 0.811 0.763 1 1 0 . R= 0.632 1 0 1 1 1

Applying Step 3, the concordance matrix is calculated. 0.8 0.8 0.8 1 1 0.5 . For example, G= 0. 5 1 0.5 1 1 1
g 21 = wC w1 + wC w2 + wC w3
= 1 0.35 + 1 0.25 + 1 0.40 = 1.0 .

Applying Step 4, the best alternative is chosen:


T 1 = 0.786 , T 2 = 0.667 , T 3 = 0.544 , T 4 = 1.000 .

The optimal ranking order of alternatives is given by A4 f A1 f A2 f A3 . The best alternative is A4 . V. D ISCUSSION

The discordance matrix is calculated:


0.267 0.143 0.357 0 0 1 . H = 0.143 0 1 0 0 0

For example:

In this study, we provide a new method, the IVIF ELECTRE method, for solving MCDM problems with IVIF information. A decision maker can use the proposed method to gain valuable information from the evaluation data provided by users, who do not usually provide preference data. Decision makers utilize IVIF data instead of single values in the evaluation process of the ELECTRE method and use those data to classify different kinds of concordance and discordance sets

34

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

to fit a real decision environment. This new approach integrates the concept of the outranking relationship of the ELECTRE method. In the proposed method, we can classify different types of concordance and discordance sets using the concepts of score function, accuracy function, membership uncertainty degree, hesitation uncertainty index, and use concordance and discordance sets to construct concordance and discordance matrices. Furthermore, decision makers can choose the best alternative using the concepts of positive and negative ideal points. We used the proposed method to rank all alternatives and determine the best alternative. This paper is the first step in using the IVIF ELECTRE method to solve MCDM problems. In a future study, we will apply the proposed method to predict consumer decision making using a questionnaire in an empirical study of service providers selecting issue. REFERENCES
[1] [2] [3] K. T. Atanassov, Intuitionistic fuzzy sets, Fuzzy sets and Systems, vol. 20, pp. 87-96, 1986. K. Atanassov and G. Gargov, Interval valued intuitionistic fuzzy sets, Fuzzy sets and Systems, vol. 31, pp. 343-349, 1989. B. Roy, Classement et choix en prsence de points de vue multiples (la mthode ELECTRE), RIRO, vol. 8, pp. 57-75, 1968.

[4]

B. Vahdani and H. Hadipour, Extension of the ELECTRE method based on interval-valued fuzzy sets, Soft Computing, vol. 15, pp. 569579, 2011. [5] B. Vahdani, A. H. K. Jabbari, V. Roshanaei, and M. Zandieh, Extension of the ELECTRE method for decision-making problems with interval weights and data, International Journal of Advanced Manufacturing Technology, vol. 50, pp. 793-800, 2010. [6] Z. Wang, K. W. Li, and W. Wang, An approach to multiattribute decision making with interval-valued intuitionistic fuzzy assessments and incomplete weights, Information Sciences, vol. 179, pp. 30263040, 2009. [7] G. W. Wei, and X. R. Wang, Some geometric aggregation operators on interval-valued intuitionistic fuzzy sets and their application to group decision making, International conference on computational intelligence and security, pp. 495-499, December 2007. [8] M.-C. Wu and T.-Y. Chen, The ELECTRE multicriteria analysis approach based on Atanassov's intuitionistic fuzzy sets, Expert Systems with Applications, vol. 38, pp. 12318-12327, 2011. [9] Z. S. Xu, On similarity measures of interval-valued intuitionistic fuzzy sets and their application to pattern recognitions, Journal of Southeast University, vol. 23, pp. 139 -143, 2007a. [10] Z. S. Xu, Methods for aggregating interval-valued intuitionistic fuzzy information and their application to decision making, Control and Decision, vol. 22, pp. 215 -219, 2007b. [11] L. A. Zadeh, Fuzzy Sets, Information and Control, vol. 8, pp. 338-353, 1965.

35

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Optimizing of Interval Type-2 Fuzzy Logic Systems Using Hybrid Heuristic Algorithm Evaluated by Classication
Adisak Sangsongfa, and Phayung Meesad Department of Information Thecnology, Faculty of Information Technology King Mongkuts University of Thecnology North Bangkok, Bangkok, Thailand Email: adisak sang@hotmail.com, pym@kmutnb.ac.th

Abstract In this research, an optimization of the rule base and the parameter of interval type-2 fuzzy set generation by a hybrid heuristic algorithm using particle swarm and genetic algorithms is proposed for classication application . For the Iris data set, 90 records were selected randomly for training, and the rest, 60 records, were used for testing. For the Wisconsin Breast Cancer data set, the author deleted the missing attribute value of 16 records and randomly selected 500 records for training, and the rest, 183 records, were used for testing. The proposed method was able to minimize rulebase, minimize linguistic variable and produce a accurate classication at 95% with the rst dataset and 98.71% with the second dataset. Keywords-Interval Type-2 Fuzzy Logic Systems; GA; PSO;

I. I NTRODUCTION In 1965, Lot A. Zadeh, professor for computer science at the University of California in Berkley, developed a fuzzy logic system which has been widely used in many areas such as decision making, classication, control, prediction, optimization and so on. However, the fuzzy logic system comes from the original system that is called the type-1 fuzzy set. Sometimes it cannot solve certain problems, especially problems that are very large, complex and/or uncertain. Therefore, in 1975 Zadeh developed and formulated a type-2 fuzzy set to meet the needs of data sets which are complex and uncertain. Thus, the type-2 fuzzy set has been used widely and continuously in many cases [1]. Recently, there has been growing interest in the interval type-2 fuzy set which is a special case of the type2 fuzzy set. Because, Mendel and John [2] reformulated all set operations in both the vertical-slice and wavy-slice manner. They concluded that general particle type-2 fuzzy set opeartions are too complex to understand and implement, but operations using the interval type-2 fuzzy set involve only simple interval arithmetics which means computation costs are reduced. The interval type-2 fuzzy set consists of four parts: fuzzication, fuzzy rule base, inference engine and defuzzications. Moreover, the fuzzy rule base and interval type-2 fuzzy sets are complicated when determining the exact membership function and complete fuzzy rule base. So, the

optimization of interval type-2 fuzzy set and fuzzy rule base must be used to estimate the value by an expert system. Many researchers have proposed and introduced optimization of interval type-2 fuzzy set and fuzzy rule base such as Zhao [3] proposed adaptive interval type-2 fuzzy set using gradient descent algorithms to optimize inference engine fuzzy rule base, Hidalgo [4] proposed optimization interval type-2 fuzzy set applied to modular neural network using a genetic algorithm. Moreever, many reseachers apply the interval type-2 fuzzy logic system for uncertain datasets. Also, thw creation of an optimized interval type-2 fuzzy logic system will gain the maximum accurate outputs. There are also many optimization techniques which have been proposed for building interval type-2 fuzzy systems. Some traditional optimization techniques are based on mathematics and some are based on heuristic algorithms. Some optimization techniques are often difcult and time consumming such as heuristic optimization. Sometimes, the improvment of the heruistic algorithms provides good performance such as the hybrid heruistic algorithms [5]. Moreover, hybrid heuristic is a much younger algorithm candidate compared to the genetic algorithm and particle swarm optimization in the domain of meta- heuristic-based optimization. In this paper, a new algorithm called the hybrid heuristic algorithm which combines a genetic algorithm to particle swarm optimization is proposed. Also, a presention of an optimization of interval type-2 fuzzy set and fuzzy rule base using the proposed hybrid heuristic algorithm. The algorithm will be used to optimize a model by minimizing the number of fuzzy rules, minimizing the number of linguistic variable and maximizing the accuracy of the output. Then the framework and the corresponding algorithms are tested and evaluated to prove the concept by applying it to the Iris dataset [6] and Wisconsin Breast Cancer Dataset as an example of classication [7].

36

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

II. R ELATED W ORK A. Particle Swarm Optimization(PSO) The PSO initializes a swarm of prticles at random, with each particle deciding its new velocity and position based on its past optimal position P1 and the past optimal position of the swarm Pg . Let xi =(xi1 , xi2 ..., xin ) represent the current position of particle i, vi =(vi1 , vi2 ,..., vin ) its current velocity and Pi =(pi1 , pi2 ,...,pin ) its past optimal position, then the particle uses the following equation to adjust its velocity and position: Vi,(t+1) = wVi,(t) + c1 r1 (Pi xi,(t) + c2 r2 (Pg xi,(t) (1) xi,(t+1) = xi,(t) + Vi,(t+1) (2)

C. Interval Type-2 Fuzzy Set Interval type-2 fuzzy sets are particularly useful when it is difcult to determine the exact membership function, or in modeling the diverse options from different individulas. The membership function, which interval type-2 fuzzy inference system approximates expert knowledge and judgment in uncertain conditions, this can be constructed from surveys or using optimization algorithms. Its basic framework consists of four basic parts: fuzzication, fuzzy rule base, fuzzy inference engine and defuzzication shown in Fig. 1.

where c1 and c2 are constants of acceleration in the range of 0..2, r1 and r2 are random number in [0,1] and w is the weight of inertia, which is used to maintain the momentum of the particle. The rst term on the right hand side in (1) is the particles velocity in time t. The second term represents self learning by the particle based on its own history. The last term reects social learning through information sharing among individual particles in the swarm. All three parts contribute to the particles search ability in the space analyzed which simulates the swarm behavior mathematically [8].

Fig. 1.

Interval Type-2 Fuzzy System

B. Genetic Algorithm(GA) A GA generally has four components: 1) a population of individuals where each individual in the population represents a possible solution, 2) a tness function which is an evaluation function by which we can tell if an individual is a good solution or not, 3) a selection function which decides how to pick good individuals from the current population for creating the next generation, and 4) genetic operators such as crossover and mutation which explore new regions of search space while keeping some of the current information at the same time. GAs are based on genetics, especially on Darwins theory (survival of the ttest). This states that the weaker members of a species tend to die away, leaving the stronger and tter. The surviving members create offspring and ensure the continuing survival of the species. This concept together with the concept of natural selection is used in information technology to enhance the performance of computers [9].

We can describe the interval type-2 fuzzy logic system as follows: the crisp sets inputs are rst fuzzied into input interval type-2 fuzzy sets. In the fuzzier, it creats the membership function which consists of types of membership function, linguistic variable and fuzzy rule base. It has many types of the membership function such as triangular membership function, trapezoidal membership function, Gaussian membership function, Smooth Membership Function, Zmembership function and so on. So, the fuzzier sends the interval type-2 fuzzy set into the inference engine and the rule base to produce output type-2 fuzzy sets. The interval type-2 fuzzy logic system rules will remain the same as in the type1 fuzzy logic system, but the antecedents and/or consequents will be represented by interval type-2 fuzzy sets. A nite number of fuzzy rules, can be represented as if-then forms, then integrates into the fuzzy rule base. A standard fuzzy rule base is shown below. R1 : If x1 is A1 and x2 is A1 , ..., xn is A1 T hen y is B 1 . n 1 2 R2 : If x1 is A2 and x2 is A2 , ..., xn is A2 T hen y is B 2 . n 1 2 . . .
M M M M RM : If x1 is A and x2 is A , ..., xn is A T hen y is B . n 1 2

where x1 ,...,xn are state c=variables, y is control variable. The linguistic value Aj ,...,Aj and B j , (j=1,2,...,M) are ren 1 spectively dened in the universe U1 ,...Un and V. In fuzzication, crisp input variable xi is mapped into interval Type-2

37

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

fuzzy set Axi , i = 1, 2, ..., n. The inference engine combines all the red rules and gives a non-linear mapping from the input interval type-2 fuzzy logic systems to the output interval type-2 fuzzy logic systems. The multiple antecedents in each rule are connected by using the Meet operation, the membership grades in the input sets are combined with those in the output sets by using the extended sup-star composition, and multiple rules are combined by using the Join operation. The type-2 fuzzy outputs of the inference engine are then processed by the typereducer, which combines the output sets and performs a centroid calculation that leads to type-1 fuzzy sets called the type-reduced sets. After the type-reduction process, the type-reduced sets are then defuzzied (by taking the average of the type-reduced) to obtain crisp outputs. [3]. In the interval type-2 fuzzy logic system design, we assumed Z-membership function for the rst membership function, triangular membership function for the secondary membership function and smooth membership function for the last membership function, center of sets type reduction and defuzzication using the centroid of the typereduced set. III. T HE P ROPOSED F RAMEWORK In our framework, we present the new algorithms of hybrid heuristic algorithm which are developed to optimize the interval type-2 fuzzy logic system using Iris datasets and breast cancer datasets. The new algorithm to optimize the interval type-2 fuzzy sets and fuzzy rule base uses hybrid heuristic searches which are a sequential combination of GA and PSO. The proposed algorithm will be used to optimize the number of linguistic variables, parameters of membership functions and the rule base which consists of constraint of the minimum linguistic variable, minimum rule base and maximum accuracy. The framework is shown in Fig. 2. From the framework, we can describe the steps of the proposed method for optimized interval type-2 fuzzy set and fuzzy rule base using hybrid heuristic searches. The framework is given in four steps described below. Step 1: Determine the structure of interval type-2 fuzzy system framework. Step 2: Determine the fuzzy rules base using clustering. Step 3: Determine the universes of the input and output variables and their type of membership functions and linguistic parameter of membership functions. Step 4: Determine and optimize the fuzzy inference engine using the hybrid heuristic algorithms which is a combination of GA and PSO.

Fig. 2. Framework of Optomization Interval Type-2 Fuzzy System Using Hybrid Heuristic Algorithms

1) Determine the structure of interval fuzzy type-2 system framework In Fig. 2. the framework shows the structure of the optimization interval type-2 fuzzy sets and rule based on hybrid heuristic algorithms. The hybrid heruistic algorithm used sequential hybridization. The GA is used for the rst local optimal interval type-2 fuzzy sets which consist of interval type-2 membership function, interval type-2 linguistic parameter (LMF, UMF) and rule base. Moreover, the PSO is used for the last optimal which is a gaining the best result dont care rule. 2) Determine the fuzzy rules base using clustering We used the K-means clustering algorithm [10] to group the dataset to determine the feasibility of a fuzzy rule base. A standard K-means clustering algorithm is shown as follows.
k n

J=
j=1 i=1

xj cj i

(3)

where K is clusters, xj cj 2 is a chosen distence i measure between a data set point xj and the cluster centre i cj , is an indicator of the distence of the n data points from their respective cluster centres. 3) Determine the universes of the input and output variables and their type of the membership functions In the universe of input and output variables and their primary membership functions, the z-membership function, triangular membership function and smooth membership function were used and are shown in Fig. 3. In Fig. 3., the presention the four attributes of Iris membership function are displayed and graded as attibute1=2, attribute2=2,

38

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE I P REDEFINED M EMBERSHIP F UNCTION VARIABLE . Linguistic Index 0 1 2 3 4 5 Linguistic Terms Dont Care Very Low Low Medium High Very High
FOR

F IVE L INGUISTIC

created by cross sections of the linguistic variables from each dimension. Then, the Fitness Function is F it = Acc(chromi ) where chromi = [chrom1 , chrom2 , ..., chromn ] is a set of the chromosome number. The accuracy (Acc) is Acc = N umber of Correct Classif ication T otal N umber of T raining Data

attribute3=5 and attribute4=5. The deniion of the linguistic label and number of linguistic variables are in Table 1.

IV. T HE E XPERIMENTAL E VALUATION S ETTING U P To evaluate the proposed Hybrid Heuristic Type-2 (HHType-2) algorithm for building interval type-2 fuzzy systems, two datasets were used which are benchmark classication datasets from UCI data repository for machine learning, Fishers iris data and Wisconsin Breast Cancer data. A. Datasets Iris dataset has 4 variables with 3 classes; 90 records were selected randomly for training, and the rest, 60 records, were used for testing. Wisconsin Breast Cancer data set has 699 records, the missing attribute value of 16 records were deleted. Each record consists of 9 features plus the class attribute; 500 records were selected randomly for training, and the rest, 183 records, were used for testing. Fig. 4 shows the scatter plot of the Iris dataset, Fig. 5 illustrates the scatter plot of the Iris dataset with clustering using K-Mean algorithms (K=7). Fig. 6 shows the scatter plot of the Wisconsin Breast Cancer dataset, and Fig. 7 shows the scatter plot of the Wisconsin Breast Cancer dataset with clustering using K-Mean algorithms (K=4).

Fig. 3.

The Example of Interval type-2 Membership Functions

4) Determine and optimization the fuzzy inference engine using the hybrid heuristic algorithms Firstly, encoding the fuzzy rule based system into genotype or chromosome. Each chromosome represents a fuzzy system composed of the number of linguistic variables in each dimension, the membership function parameters of each linguistic variable, and the fuzzy rules which consists of dont care rules from the PSO. A chromosome (chrom) consists of 4 parts or genes:
by GA by P SO

chrom = [IM, IL, R , DcR]

(4)

where IL = [IL1 , IL2 , ...ILn ] is a set of the number of interval linguistic variables, IM = [im11 , im12 , .., imn,ILn ] is a set of the interval membership function parameters of the interval linguistic variables, R = [R1 , R2 , ..., RIL1 IL2 ...ILn ] is the fuzzy rule. R1 is interger number that is the index of linguistic variable of each dimension, and DcR=[Ra111L L ,...,L , Ra112L L ,...,L , ..., n n 1 2 1 2 RalmkL L ,...,L ] is the dont care rule. n 1 2 RalmkL L ,...,L is the integer number which is the n 1 2 index of the dont care rule of each rule. The length of a chromosome can be varied depending on the fuzzy partition

Fig. 4. The scatter plot of Iris Dataset (* represents Setosa, represents Versicolor, and represents Verginica)

B. Experimental Results The experiments were performed on a MacBook Pro Intel Core 2 Duo CPU, speed 2.66 GHz, ram 4.00 GB RAM,

39

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE II C ONFUSION M ATRIX FOR THE I RIS CLASSIFICATION DATA . Dataset Iris Membership [2 2 5 5] Rule 0011 0133 2155 000000000 101021222 223221222 Class 1 2 3 2 4 4 98.71% TABLE III C ONFUSION M ATRIX FOR THE I RIS CLASSIFICATION DATA . Attribute Setosa Versicolor Verginica Total Setosa 20 0 0 20 Versicolor 0 19 2 21 Verginica 0 1 18 19 Total Testing 20 20 20 60 Acc

Total Acc WBC

95% [2 2 3 2 2 3 2 2 2]

Total Acc Fig. 5. The scatter plot of Iris Dataset with Clustering (* represents Setosa, represents Versicolor, and represents Verginica)

Fig. 6. The scatter plot of Wisconsin Breast Cancer Dataset (* represents Class 2, and represents Class 4)

and 5 particles. Then, the PSO completed 20 runs with the execute time of 2387.5543s. The optimal fuzzy system which was optimized using the hybrid heuristic algorithm generated accuracy performance as shown in Tables 2, 3. An example of a chromosome from the WBC datasets is shown in Fig. 8.
M embership Linguisticparameter

running on Mac os. All algorithms are implemented using Matlab. The rst dataset (Iris dataset), ran 20 times with the averages execute time of 662.2635s. The simulation population was 100 individuals. Then, the largest individuals from the PSO were used to optimize the dont care rule. In the PSO, each of the individuals were simulated with 50 swarms and 5 particles. The PSO completed 20 runs with the excite time of 429.7597s. In the second dataset (Wisconsin Breast Cancer (WBC)), ran 20 times with the average execute time of 3679.2428s. The simulation population was 100 individuals. The individuals from PSO were used to optimize the dont care rule. The individuals of the PSO were simulated with 50 swarms

[2 2 3 2 2 3 2 2 2] 1.9782 3, 4612 7.8462 9.1217 3.3353 3.3353 6.5211 1.8434 4.2727 1.0098 1.0312 1.6815 8.3999 1.9247 3.5459 1.9992 5.2612 1.0692 1.1521 2.1435 2.1556 3.6942 7.6163,........,2.6585 3.1273 7.0503 9.8831 3.9131 3.1549 6.9534 111111111 1 222123221 4 223222221 4 000000000 2 101021222 4 223221222 4
RuleBased

Fig. 8. Chromosome of Interval type-2 Fuzzy Logic System WBC dataset

Fig. 7. The scatter plot of Wisconsin Breast Cancer Dataset with Clustering (* represents Class 2, and represents Class 4)

To prove the excellent performance of this proposed framework, we compared its accuracy with other well-known classiers, manipulated for the same probem. Table 4 presents the accuracy performance of classiers with these algorithms. From Table 4, it can be seen that the accuracy performance of the proposed hybrid heuristic algorithm is among the best achieved. In the same way, we compared the results of the condence gained from experiments using the algorithms with the same problem to other algorithms. Table 5 shows the accuracy performance of classier from these algorithms and the condence of the Wisconsin Breast Cancer dataset using the Hybrid Heuristic Type-2 (HHType-2) algorithm, which results were competitive or even better than any other algorithm. Although GA and PSO are not new, when the two come together they make a powerful new algorithm (Hybrid

40

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE IV C OMPARISONS OF THE HHT YPE -2


AND THE OTHER ALGORITHMS , FOR THE I RIS DATA .

Fig. 9. The Bar chart of comparisons of the HHType-2 and the other algorithms, for the Iris data.

Algorithm 1.VSM [11] 2.NT-growth [11] 3.Dasarathy [11] 4.C4 [11] 5.IRSS [12] 6.PSOCCAS [13] 7.HHTypeI [5] 8. HHType II

Setosa 100% 100% 100% 100% 100% 100% 100% 100%

Versicolor 93.33% 93.5% 98% 91.07% 92% 96% 97% 95% TABLE V

Verginica 94% 91.13% 86% 90.61% 96% 98% 98% 90%

Acc 95.78% 94.87% 94.67% 93.87% 96% 98% 98% 95%

C OMPARISONS OF THE HHT YPE -2


THE

AND THE OTHER ALGORITHMS , FOR DATA .

WBC

Algorithm 1. SANFIS [14] 2. FUZZY [15] 3. ILFN [15] 4. ILFN-FUZZY [15] 5. IGANFIS [16] 6. HHType II

Accuracy 96.07% 96.71% 97.23% 98.13% 98.24% 98.71%

Fig. 10. The Bar chart of comparisons of the HHType-2 and the other algorithms, for the WBC data.

Heuristic Type-2) for optimization which it is quite efcient referring to the performance. V. C ONCLUSION In this paper, a methodology based on a hybrid heuristic algorithm, a combination of PSO and GA approaches, is proposed to build interval type-2 fuzzy set for classication. The algorithms are used to optimize a model by minimizing the number of fuzzy rule, minimizing the number of linguistic variable and maximizing the accuracy of the fuzzy rule base. The performance of the proposed hybrid heuristic algorithm was demonstrated well by applying it to the benchmark problem and the comparison with several other algorithms. For the future research, the application of the proposed algorithm to other problems such as intrusion detection network, network forensic etc., and the use of lerger datasets than this research such as Breast Cancer Diagnosis, trafc network dataset etc, will be covered. Therefore, an adaptive on-line inference engine of the interval type-2 fuzzy set will be selected for future research of Breast Cancer Diagnosis for medical training and testing. R EFERENCES
[1] J. M. Mendel, Why we need type-2 fuzzy logic system ? May 2001, http://www.informit.com/articles/article.asp. [2] J. M. Mendel and R. I. B. John, Type-2 fuzzy sets made simple, IEEE Trans. Fuzzy Syst, vol. 10, pp. 117127, April 2002.

[3] L. Zhao, Adaptive interval type-2 fuzzy control based on gradient descent algorithm, in Intelligent Control and Information Processing (ICICIP), vol. 2, 2011, pp. 899904. [4] D. Hidalgo, P. Melin, O. Castillo, and G. Licea, Optimization of interval type-2 fuzzy systems based on the level of uncertainty, applied to response integration in modular neural networks with multimodal biometry, in The 2010 International Joint Conference on Digital Object Identier, 2010, pp. 16. [5] A. Sangsongfa and P. Meesad, Fuzzy rule base generation by a hybrid heuristic algorithm and application for classication, in National Conference on Computing and Information Technology, vol. 1, 2010, pp. 1419. [6] IrisDataset, http://www.ailab.si/orange/doc/datasets/Iris.htm. [7] Breast Cancer Dataset, http://www.breastcancer.org. [8] J. Zeng and L. Wang, A generalized model of particle swarm optimization, Pattern Recognition and Articial Intelligence, vol. 18, pp. 685688, 2005. [9] H. Ishibuchi, T. Nakashima, and T. Murata, Three-objective genetic based machine learning for linguistic rule extraction, Information Sciences, vol. 136, pp. 109133, 2001. [10] R. Salman, V. Kecman, Q. Li, R. Strack, and E. Test, Fast k-means algorithm clustering, Transactions on Machine Learning and Data Mining, vol. 3, p. 16, 2011. [11] T. P. Hong and J. B. Chen, Processing individual fuzzy attributes for fuzzy rule induction, in Fuzzy Sets and Systems, vol. 10, 2000, pp. 127140. [12] A. Chatterjee and A. Rakshit, Inuential rule search scheme (irss)a new fuzzy pattern classier, in IEEE Transaction on Knowledge and Data Engineering, vol. 16, 2004, pp. 881893. [13] L. Hongfei and P. Erxu, A particle swarm optimization- aided fuzzy cloud classier applied for plant numerical taxonomy based on attribute similarity, in Expert Systems with Applications, vol. 36, 2009, pp. 93889397. [14] H. Song, S. Lee, D. Kim, and G. Park, New methodology of computer aided diagnostic system on breast cancer, in Second International Symposium on Neural Networks, 2005, pp. 780789. [15] P. Meesad and G. Yen, Combined numerical and linguistic knowledge representation and its application to medical diagnosis, in Component and Systems Diagnostics, Prognostics, and Health Management II, 2003. [16] M. Ashraf, L. Kim, and X. Huang;, Information gain and adaptive neuro-fuzzy inference system for breast cancer diagnoses, in Computer Sciences and Convergence Information Technology (ICCIT), 2010, pp. 911915.

41

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Neural Network Modeling for an Intelligent Recommendation System Supporting SRM for Universities in Thailand
Kanokwan Kongsakun
School of Information Technology Murdoch University, South Street, Murdoch,WA 6150 AUSTRALIA kokoya002@yahoo.com

Jesada Kajornrit
School of Information Technology Murdoch University, South Street, Murdoch,WA 6150 AUSTRALIA J_kajornrit@hotmail.com

Chun Che Fung


School of Information Technology Murdoch University, South Street, Murdoch,WA 6150 AUSTRALIA l.fung@murdoch.edu.au

Abstract In order to support the academic management processes, many universities in Thailand have developed innovative information systems and services with an aim to enhance efficiency and student relationship. Some of these initiatives are in the form of a Student Recommendation System(SRM). However, the success or appropriateness of such system depends on the expertise and knowledge of the counselor. This paper describes the development of a proposed Intelligent Recommendation System (IRS) framework and experimental results. The proposed system is based on an investigation of the possible correlations between the students historic records and final results. Neural Network techniques have been used with an aim to find the structures and relationships within the data, and the final Grade Point Averages of freshmen in a number of courses are the subjects of interest. This information will help the counselors in recommending the appropriate courses for students thereby increasing their chances of success. Keywords-Intelligent Recommendation System; Relationship Management; data mining; neural network Student

I.

INTRODUCTION

The growing complexity of technology in educational institutions creates opportunities for substantial improvements for management and information systems. Many designs and techniques have allowed for better results in analysis and recommendations. With this in mind, universities in Thailand are working hard to improve the quality of education and many institutes are focusing on how to increase the student retention rates and the number of completions. In addition, a universitys performance is also increasingly being used to measure its ranking and reputation [1]. One form of service which is normally provided by all universities is Student Counseling. Archer and Cooper [2] stated that the provision of counseling services is an important factor contributing to students academic success. In addition, Urata and Takano [3] stated that the essence of student counseling should include advices on career guidance, identification of learning strategies, handling of inter-personal relation, along with selfunderstanding of the mind and body. It can be said that a key aspect of student services is to provide course guidance as this

will assist the students in their course selection and future university experience. On the other hand, many students have chosen particular courses of study just because of perceived job opportunities, peer pressure and parental advice. Issues may arise if a student is not interested in the course, or if the course or career is not suitably matched with the students capability[4]. In Thailands tertiary education sector, teaching staff may have insufficient time to counsel the students due to high workload and there are inadequate tools to support them. Hence, it is desirable that some forms of intelligent recommendation tools could be developed to assist staff and students in the enrolment process. This forms the motivation of this research. One of the initiatives designed to help students and staff is the Student Recommendation System. Such system could be used to provide course advice and counseling for freshmen in order to achieve a better match between the students ability and success in course completion. In the case of Thai universities, this service is normally provided by counselors or advisors who have many years of experience within the organisation. However, with increasing number of students and expanded number of choices, the workload on the advisors is becoming too much to handle. It becomes apparent that some forms of intelligent system will be useful in assisting the advisors. In this paper, a proposed intelligent recommendation system is reported. This paper is structured as follows. Section 2 describes literature reviews of Student Relationship Management (SRM) in universities and issues faced by Thai university students. Section 3 describes Neural Network techniques which are used in the reported Intelligent Recommendation System, and Section 4 focuses on the proposed framework, which presents the main idea and the research methodology. Section 5 describes the experiments and the results. This paper then concludes with discussions on the work to be undertaken and future development.

42

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

II. A.

LITERATURE REVIEW

B.

Student Relationship Management in Universities According to literature, the problem of low student retention in higher education could be attributed to low student satisfaction, student transfers and drop-outs [5]. This issue leads to a reduction in the number of enrolments and revenue, and increasing cost of replacement. On the other hand, it was found that the quality and convenience of support services are other factors that influence students to change educational institutes [6]. Consequently, the concept of SRM has been implemented in various universities so as to assist the improvement of the quality of learning processes and student activities. Definitions of SRM have been adopted from the established practices of Customer Relationship Management (CRM) which focuses on customers and are aimed to establish effective competition and new strategies in order to improve the performance of a firm [7]. In the case of SRM, the context is within the education sector. Although there have been many research focused on CRM, few research studies have concentrated on SRM. In addition, the technological supports are inadequate to sustain SRM in universities. For instance, a SRM systems architecture has been proposed so as to support the SRM concepts and techniques that assist the universitys Business Intelligent System [8]. This project provided a tool to aid the tertiary students in their decision-making process. The SRM strategy also provided the institution with SRM practices, including the planned activities to be developed for the students, as well other relevant participants. However, the study verified that the technological support to the SRM concepts and practices were insufficient at the time of writing [8]. In the context of educational institutes, the students may be considered having a role as customers, and the objective of Student Relationship Management is to increase their satisfaction and loyalty for the benefits of the institute. SRM may be defined under a similar view as CRM and aims at developing and maintaining a close relationship between the institute and the students by supporting the management processes and monitoring the students academic activities and behaviors. Piedade and Santos (2008) explained that SRM involves the identification of performance indicators and behavioral patterns that characterize the students and the different situations under which the students are supervised. In addition, the concept of SRM is understood as a process based on the student acquired knowledge, whose main purpose is to keep a close and effective students institution relationship through the closely monitoring of their academic activities along their academic path [9]. Hence, it can be said that SRM can be utilised as an important means to support and enhance a students satisfaction. Since understanding the needs of the students is essential for their satisfaction, it is necessary to prepare strategies in both teaching and related services to support Student Relationship Management. This paper therefore proposes an innovative information system to assist students in universities in order to support the SRM concept.
Identify applicable sponsor/s here. (sponsors)

Issues Faced By Thai University Students Another study at Dhurakij Pundit University, Thailand looked at the relationship between learning behaviour and low academic achievement (below 2.0 GPA) of the first year students in the regular four-year undergraduate degree programs. The results indicated that students who had low academic achievement had a moderate score in every aspect of learning behaviour. On average, the students scored highest in class attendance, followed by the attempt to spend more time on study after obtaining low examination grades. Some of the problems and difficulties that mostly affected students low academic achievement were the students lack of understanding of the subject and lack of motivation and enthusiasm to learn [10]. Moreover, some other studies had focused on issues relating to students backgrounds prior to their enrolment, which may have effects on the progress of the students studies. For example, a research group from the Department of Education[11], Thailand studied the backgrounds of 289,007 Grade twelve students which may have affected their academic achievements. The study showed that the factors which could have effects on the academic achievement of the students may be attributed to personal information such as gender and interests, parental factors such as their jobs and qualifications, and information on the schools such as their sizes, types and ranking. Therefore, in the recruitment and enrolment of students in higher education, it is necessary to meet the students needs and to match their capability with the course of their choice. The students backgrounds may also have a part to play in the matching process. Understanding the students needs will implicitly enhance the students learning experience and increase their chances of success, and thereby reduce the wastage of resources due to dropouts, and change of programs. These factors are therefore taken into consideration in the proposed recommendation system in this study. III. NEURAL NETWORK BASED INTELLIGENT RECOMMENDATION SYSTEM TO SUPPORT SRM In term of education systems, Ackerman and Schibrowsky [12] have applied the concept of business relationships and proposed the business relationship marketing framework. The framework provided a different view on retention strategies and an economic justification on the need for implementing retention programs. The prominent result is the improvement of graduation rates by 65% by simply retaining one additional student out of every ten. The researcher added that this framework is appropriate both on the issues of places on quality of services. Although some problems could not be solved directly, it is recognized that Information and Communication Technologies (ICT) can be used and contributes towards maintaining a stronger relationship with students in the educational systems [8]. In this study, a new intelligent Recommendation System is proposed to support universities students in Thailand. This System is a hybrid system which is based by Neural Network

43

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

and Data Mining techniques; however, this paper only focuses on the aspect of Neural Network (NN) techniques. With respect to the Neural Network algorithm that was used in this study, the feed-forward neural network, also called Multilayer Perceptron was used. In the training of a Multilayer perceptron, back propagation learning algorithm (BP) was used to perform the supervised learning process [13]. The feed-forward calculations which use in this experiment, the activations set to the values of the encoded input fields in Input Neurons. The activation of each neuron in a hidden or output layer is calculated and shown as follows: bi = ( jijPj ) (1)

1.

Data Pre-processing

Student Historic data

Data Transformation

Data Cleaning

Neural Network

Decision Tree

SVM Association Rules

Result Comparison/ Best Result

where bi is the activation of neuron i, j is the set of neurons in the preceding layer, ij is the weight of the connection between neuron i and neuron j, Pj is the output of neuron j, and ( m ) is the sigmoid or logistic transfer function, which show as follows (m) = 1/(1+e-m) (2)

GPA for year 1

Likelihood of overall GPA

Course Ranking Recom mendati on

GPA for year 2 GPA for year 3 GPA for year 4 B. Likelihood of GPA

Subject Recommen dation

The implementation of back propagation learning updates the network weights and biases in the direction in which the system performance increases most rapidly. This study used a feed-forward network architecture and the Mean Absolute Error (MAE) to define the accuracy of the models. IV. THE PROPOSED FRAMEWORK

A.Course Recommendation

C. Subject Recommendation

3. Intelligent Prediction Models


Sub-models of A/B/C: Sub-models of A/B/C:

Sub-models of A/B/C:

Computer Business

Information Technology

Communication Art

Several solutions have been proposed to support SRM in the universities; however, not many systems in Thailand have focused on recommendation systems using historic records from graduated students. A recommendation system could apply statistical, artificial intelligence and data mining techniques by making appropriate recommendation for the students. Figure 1 illustrates the proposed recommendation system architecture. This proposal aims to analyse student background such as the high school where the student studied previously, school results and student performance in terms of GPAs from the universitys database. The result can then be used to match the profiles of the new students. In this way, the recommendation system is designed to provide suggestions on the most appropriate courses and subjects for the students, based on historical records from the universitys database. A. Data-Preprocessing Initially, data on the student records are collected from the university enterprise database. The data is then re-formatted in the stage of data transformation in order to prepare for processing by subsequent algorithms. In the data cleaning process, the parameters used in the data analysis are identified and the missing data are either eliminated or filled with null values [15]. Preparation of analytical variables is done in the data transformation step or being completed in a separate process. Integrity of the data is checked by validating the data

Student

Electronic Intelligent Recommendatio n System(e-IRS)


4. Model Validation

Recommenda tion

Figure 1. Proposed Hybrid Recommendation System Framework to Support Student Relationship Management

against the legitimate range of values and data types. Finally, the data is separated randomly into training and testing data for processing by the Neural Network. B. Data Analysis It can be seen in Fig. 1 that the Association rules, Decision Tree, Support Vector Machines and Neural Network are used to train the input data; however, this paper focuses on Neural Network which uses the feed-forward algorithm to classify the data and to establish the approximate function. The backpropagation algorithm is a multilayer network, it uses logsigmoid as the transfer function, logsig. In the training process, the backpropagation training functions in the feedforward networks is used to predict the output based on the input data.

44

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

C. Intelligent Recommendation Model The Integrated Recommendation Model is composed of three parts: Course Recommendation for freshmen, Likelihood of GPA for students (years 1 to 4), and Subject Recommendation for students (year 1 to 4) respectively. Part A focuses on the course recommendation for freshmen and it is composed of two sections, which are the Overall GPA Recommendation, and the Course Ranking Recommendation respectively. In the section of Overall GPA, The output of this recommendation is in terms of an expected overall GPA. The outputs of Course Ranking Recommendation use the ranking of results in first section to indicate five appropriate courses The results of both parts can be used as suggestions to the freshmen during the enrolment process. Some example results from Part A are shown in this paper, and the input data of these 2 sections in the model are shown in Table 1. Another part of the framework focuses on Likelihood of GPA for students in each year. After the students selected the course to study and completed the enrolment process, the Likelihood of GPA for year 1 results can be used to monitor the performance of this group of students. The input data of this process is the same as the one shown in Table 1, with the addition of the GPA scores from the previous year. These are used as the extended features in the input to the neural network model. The result of the Recommendation is the GPA score of the year. In the same way, the system may be used to perform a Likelihood of GPA for Year 2 based on results from the first year. Similar approach can be adopted for the Likelihood of Year 3 and 4 results. Some example results of this part are shown in this paper. The final part of the recommendation model focuses on the subject recommendation for students in each year. This way also can help the counselor or students supervisor recommend student to enroll the subjects in each semester. To address the issue of imbalanced number of students in each course, the prediction model shown in Fig. 1 can be duplicated for different departments. The models computation is entirely data-driven and not based on subjective opinion, hence, the prediction models are unbiased and they will be used as an integral part of an Electronic Intelligent Recommendation System. D. Electronic Intelligent Recommendation System (e-IRS) It is planned that the new intelligent Recommendation Models will form an integral part of an online system for private universities in Thailand. The developed system will be evaluated by the university management and feedback from experienced counselors will be sought. The proposed system will also be available for use by new students who will access the online-application in their course selection during the enrolment process. As for the recommendation of the Year 2 and subsequent years results, this could be used by the counselors, staff, students supervisor and university management to provide supports for students who are likely to need help with their studies. This information will enable the university to better focus on the utilisation of their resources. In particular, this could be used to improve the retention rate

by providing additional supports to the group of students who may be at risk. V. EXPERIMENT DESIGN

The data preparation and selection process involves a dataset of 3,550 student records from five academic years. All the student data have included records from the first year to graduation. Due to privacy issue, the data in this study do not indicate any personal information, and no student is identified in the research. The student data has been randomised, and all private information has been removed. Example data from the dataset is shown below.
TABLE I. EXAMPLE OF TRAINING SAMPLE DATASET Input data: previous school data
Guardian Occupation Talent and Interest Type of school No. of Awards

tar get

Admission Round

Channels

Pre-GPA

Gender
F M F F M F

Uni ID

Uni GPA

4800 4801 5001 5002 5003 5101

2.35 3.55 2.55 2.75 3.00 2.00

C B A G F E

0.2 0.3 0.9 0.4 0.2 0.1

1 4 3 5 7 2

Poster Brochur e Friend Family Newspap er others

1 2 5 4 3 1

Police Governor Teacher Nurse Teacher Farmer

3.75 3.05 2.09 2.58 2.77 2.11

Table 1 shows the randomized student ID, GPA from previous study, the type of school, awards received, talent and interest, channels to know the university, admission round, Guardian Occupation, Gender and Overall GPA from university. Table 2 provides the definitions for the variables used in the above table.
TABLE II. No. 1. Variables UniID DEFINITIONS OF VARIABLES Definition Randomized Student ID which is not included in the clustering process. They are only used as an identification of different students Overall GPA results from previous study prior to admission to university The school types are separated as follows A: High School B:Technical College C: Commercial College D: Open School E: Sports, Thai Dancing, Religion or Handcraft Training Schools F: Other Universities (change universities or courses) G: Vocation Training Schools Awards that students have received from previous study (normalized between 0.0 to 4.0, 0.0 received no award, 4.0 received max no. of

2.

GPA

3.

Type of school

4.

Number of Awards

45

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

awards in the dataset) Talent and the interest(1= sports, 5. Talent and Interest 2=music and entertainment, 3= (in Group number) presentation, 4=academic, 5=others, 6= involved with 2 to 3 items of talents and interests, 7= involved with more than 3 talents and interests) 6 Channels The channels to know the university such as television, family 7 Admission Round Admission round of each university which can be round 1 to 5 8 Guardian The occupation of Guardian such as Occupation teacher, governor 9. Gender Gender: Female or Male 10. Uni GPA Overall GPA in university which the range is from 0 to 4

0.400 0.350
0.300 0.250
0.190 0.125
0.080

0.200 0.150 0.100 0.050


0.000
0.087 0.069 0.074

0.177 0.177 0.115

0.172 0.182

Figure 3. Comparison of MAE of testing data of sub-models for overall GPA and course ranking Recommendation

The testing was carried out in the final step of the experiment in each model, which used 30% of the available data. In Fig. 3, it is shown that the lowest value of Mean Absolute Error (MAE) is 0.069 based on data from the Department of Accounting. On the other hand, the highest value is 0.344. The average of MAE of all models is 0.142. The overall results obtained indicated reasonable prediction results were obtained.

Figure 2. Number of samples in each department

The student records have been divided into 70% of training data and 30% of testing data randomly. The dataset includes both qualitative and quantitative information in Table 1 and 2. In terms of training, this study used a two layer feed forward network architecture. Moreover, this study used the Mean Absolute Errors (MAE) to define the accuracy of the models. VI. EXPERIMENTAL RESULTS
Figure 4. Comparison of MAE of testing data on the Likelihood of GPA in each Year

Based on MAE, the experimental results have shown that the Neural Network based models can be utilised to predict the GPA results of students with a good degree of accuracy.

Fig. 4 shows a comparison of MAE of the results of the sub-models from each department in each year. It can be seen that the range of values of MAE is the lowest based on data from the Department of Education. On the other hand, the highest value is based on the Department of Communication Arts, which is similar with the results of overall GPA. The average of MAE of all models is 0.393. Considering, the department of Public Administration gives the similar results between each year, while the department of Communication Art and the department of Industrial Management give the most different results between each year, which the results of MAE is too high from another in year 4 and year 2 respectively. Its possible that the difference of MAE is due to

46

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

the number of training and testing data. The overall results obtained have indicated reasonable recommendation results. VII. CONCLUSIONS This article describes a recommendation system in support of SRM and to address issues related to the problem of course advice or counseling for university students in Thailand. The recent work is focusing on the development and implementation of each process in the framework. The experiments have been based on Neural Network models and the accuracy of the recommendation model is reasonable. It is expected that the recommendation system will provide a useful service for the university management, course counselors, academic staff and students. The proposed system will also support Student Relationship Management strategies among the Thai private universities. REFERENCES
[1] R. Ackerman and J. Schibrowsky, A Business Marketing Strategy Applied to Student Retention: A higher Education Initiative, Journal of College Student Retention . vol. 9(3), pp. 330-336, 2007-2008 J. Jr. Archer and S. Cooper,Counselling and Mental Health Services on Campus. In A handbook of Contemporary Practices and Challenges, Jossy-Bass, ed. San Francisco, CA., 1998 A.L. Caison, Determinates of Systemic Retention: Implications for improving retention practice in higher education. Journal of College Student Retention., vol. 6, pp. 425-441, 2004-2005 K. L. Du and M.N.S Swamy, Neural Networks in a Softcomputing Framework, Germany: Springer , vol. 1, 2006 Education, Research group of department of A study of the background of grade twelve affect students different academic achievements, Education Research, 2000

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[2]

[3]

[14]

[4] [5]

[15]

D.T. Gamage, J. Suwanabroma, T. Ueyama, S. Hada. and E.Sekikawa, The impact of quality assurance measures on student services at the Japanese and Thai private universities, Quality Assurance in Education, vol 16(2), pp.181-198, 2008 Y. Gao and C. Zhang, Research on Customer Relationship Management Application System of Manufacturing Enterprises, Wireless Communications, Networking and Mobile computing, 2008 Wicom'08.4th International conference, Dalian , pp. 1-4, 2008 K. Harej and R.V. Horvat, Customer Relationship Management Momentum for Business Improvement, Information Technology Interfaces(ITI), pp.107-111, 2004 Helland, P., H.J. Stallings, and J.M. Braxton, The fulfillment of expectations for college and student departure decisions,, Journal of College Student Retention, vol. 3(4), pp.381-396, 2001-2002 N. Jantarasapt, The relationship between the study behavior and low academic achievement of students of Dhurakij Pundit University, Thailand, Dhurakij Pundit University, 2005 K. Jusoff, S.A.A. Samah, and P.M. Isa, Promoting university community's creative citizenry. Proceedings of world academy of science, 2008, Engineering and technology , vol. 33, pp. 1-6.. 2008 M.B. Piedade and M. Y. Santos, Student Relationship Management: Concept, Practice and Technological Support, IEEE Xplore, pp. 2-5, 2008 S. Subyam, Causes of Dropout and Program Incompletion among Undergraduate Students from the Faculty of Engineering,King Mongkut's University of Technology North Bangkok., In The 8th National Conference on Engineering Education. Le Meridien Chiang Mai, Muang, Chiang Mai, Thailand, 2009 U. Uruta and A. Takano, Between psychology and college of education, Journal of Educational Psychology, vol 51, pp. 205-217, 2003 K.W. Wong, C.C. Fung and T.D. Gedeon, Data Mining Using Neural Fuzzy for Student Relationship Management, International Conference of Soft Computing and Intelligent Systems, Tsukuba, Japan, 2002

47

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Recommendation and Application of Fault Tolerance Patterns to Services


Tunyathorn Leelawatcharamas and Twittie Senivongse
Computer Science Program, Department of Computer Engineering Faculty of Engineering, Chulalongkorn University Bangkok, Thailand tunyathorn.l@student.chula.ac.th , twittie.s@chula.ac.th

AbstractService technology such as Web services has been one of the mainstream technologies in todays software development. Distributed services may suffer from communication problems or contain faults themselves, and hence service consumers may experience service interruption. A solution is to create services which can tolerate faults so that failures can be made transparent to the consumers. Since there are many patterns of software fault tolerance available, we end up with a question of which pattern should be applied to a particular service. This paper attempts to recommend to service developers the patterns for fault tolerant services. A recommendation model is proposed based on characteristics of the service itself and of the service provision environment. Once a fault tolerance pattern is chosen, a fault tolerant version of the service can be created as a WS-BPEL service. A software tool is developed to assist in pattern recommendation and generation of the fault tolerant service version.
Keywords - fault tolerance patterns; Web services; WS-BPEL

responses. (2) System and network faults are those that can be identified, for example, through HTTP status code and detected by execution environment, e.g., communication timeout, server error, service unavailable. (3) SLA faults are raised when services violate SLAs, e.g., response time requirements, even though functional requirements are fulfilled. For service providers, one of the main goals of service provision is service reliability. Services should be provided in a reliable execution environment and prepared for various faults so that failures can be made as transparent as possible to service consumers. Service designers should therefore design services with a fault tolerance mindset, expecting the unexpected and preparing to prevent and handle potential failures. There are many fault tolerance patterns or exception handling strategies that can be applied to make software and systems more reliable. Common patterns involve how to handle or recover from failures, such as communication retry or the use of redundant system nodes. In a distributed services context, we end up with a question of which fault tolerance pattern should be applied to a particular service. We argue that not all patterns are equally appropriate for any services. This is due to the characteristics of each service including service semantics and the environment of service provision. In this paper, we propose a mathematical model that can assist service designers in designing fault tolerant versions of services. The model helps recommend which fault tolerance patterns are suitable for particular services. With a supporting tool, service designers can choose a recommended pattern and have fault tolerant versions of the services generated as WS-BPEL services. Section II discusses related work in Web services fault tolerance. Section III lists fault tolerance patterns that are considered in our work. Characteristics of the services and condition of service provision that we use as criteria for pattern recommendation are given in Section IV. Section V presents how service designers can be assisted by the pattern recommendation model. The paper concludes in Section VI with future outlook. II. RELATED WORK

I.

INTRODUCTION

Service technology has been one of the mainstream technologies in todays software development since it enables rapid flexible development and integration of software systems. The current Web services technology builds software upon basic building blocks called Web services. They are software units that provide certain functionalities over the Web and involve a set of interface and protocol standards, e.g. Web Service Definition Language (WSDL) for describing service interfaces, SOAP as a messaging protocol, and Business Process Execution Language (WS-BPEL) for describing business processes of collaborating services [1]. Like other software, services may suffer from communication problems or contain faults themselves, and hence service consumers may experience service interruption. Different types of faults have been classified for services [2], [3], [4], and can be viewed roughly in three categories: (1) Logic faults comprise calculation faults, data content faults, and other logic-related faults thrown specifically by the service. Web service consumers can detect logic faults by WSDL fault messages or have a way to check correctness of service

A number of researches in the area of fault tolerance services address the application of fault tolerance patterns to WS-BPEL processes even though they may have a different use of fault tolerance terminology for similar patterns or

48

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

strategies. For example, Dobsons work [5] is among the first in this area which proposes how to use BPEL language constructs to implement fault tolerant service invocation using four different patterns, i.e., retry, retry on a backup, and parallel invocations to different backups with voting on all responses or taking the first response. Lau et al. [6] use BPEL to specify passive and active replication of services in a business process and also support a backup of BPEL engine itself. Liu et al. [2] propose a service framework which combines exception handling and transaction techniques to improve reliability of composite services. Service designers can specify exception handling logic for a particular service invocation as an EventCondition-Action rule, and eight strategies are supported, i.e., ignore, notify, skip, retry, retryUntil, alternate, replicate, and wait. Thaisongsuwan and Senivongse [7] define the implementation of fault tolerance patterns, as classified by Hanmer [8], on BPEL processes. Nine of the architectural, detection, and recovery patterns are addressed, i.e., Units of Mitigation, Quarantine, Error Handler, Redundancy, Recovery Block, Limit Retries, Escalation, Roll-Forward, and Voting. These researches suggest that different patterns can be applied to different service invocations as appropriate but are not specific on when to apply which. Nevertheless we adopt their BPEL implementations of the patterns for the generation of our fault tolerant services. Zheng and Lyu present interesting approaches to fault tolerant Web services which support strategies including retry, recovery block, N-version programming (i.e., parallel service invocations with voting on all responses), and active (i.e., parallel service invocations with taking the first response). For composite services, they propose a QoS model for fault tolerant service composition which helps determine which combination of the fault tolerance strategies gives a composite service the optimal quality [9]. In the context of individual Web services, they propose a dynamic fault tolerance strategy selection for a service [3]; the optimal strategy is one that gives optimal service roundtrip time and failure rate. Both user-defined service constraints and current QoS information of the service are considered in the selection algorithm. In [10], they view fault tolerance strategies as time-redundancy and spaceredundancy (i.e., passive and active replication) as well as combination of those strategies. Although their approaches and ours share the same motivation, their fault tolerance strategy selection requires an architecture that supports service QoS monitoring and provision of replica services. This could be too much to afford for strategy selection, for example, if it turns out that expensive strategies involving replica nodes are not appropriate. This paper can be complementary to their approach but it is more lightweight by merely recommending which fault tolerance strategies are likely to match service characteristics that are of concern to service designers. III. FAULT TOLERANCE PATTERNS

[Fail and RetryCondition]


Call Service

[WaitCondition]
Call Service

(1) Retry

(2) Wait

Call Service

[Fail]

Call Replica

[Fail]

(3) RB Replica

(4) RB NVP

(5) Active Replica

(6) Active NVP

(7) Voting Replica

(8) Voting NVP

[Fail and RetryCondition]

Call Service

[ WaitCondition]
Call Service

In our approach, the following fault tolerance patterns are supported (Fig. 1). They are addressed in Section II and can be expressed using BPEL which is the target implementation of our fault tolerant services. Here the term service to which a pattern will be applied refers to the smallest unit of service provision, e.g., an operation of a Web service implementation.

(9) Retry + Wait

Figure 1.

Fault tolerance patterns.

49

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

1) Retry: When service invocation is not successful, invocation to the same service is repeated until it succeeds or a condition is evaluated to true. A common condition is the allowed retry times. 2) Wait: Service invocation is delayed until a specified time. If the service is expected to be busy or unavailable at a particular time, delaying invocation until a later time could help decrease failure probability. 3) RecoveryBlockReplica: When service invocation is not successful, invocation is made sequentially to a number of functionally equivalent alternatives (i.e., recovery blocks) until the invocation succeeds or all alternatives are used. Here the alternatives are replicas of the original service; they can be different copies of the orignal service but are provided in different execution environments. 4) RecoveryBlockNVP: This pattern is similar to 3) but adopts N-version programming (NVP). Here the original service and its alternatives are developed by different development teams or with different technologies, algorithms, or programming languages, and they may be provided in the same or different execution environment. This would be more reliable than having replicas of the original services as alternatives since it can decrease the failure probability caused by faults in the original service. 5) ActiveReplica: To increase the probability that service invocation will return in a timely manner, invocation is made to a group of functionally equivalent services in parallel. The first successful response from any service is taken as the invocation result. Here the group are replicas of each other; they can be different copies of the same service but are provided in different execution environments. 6) ActiveNVP: This pattern is similar to 5) but adopts NVP. Here the services in the group are developed by different development teams or with different technologies, algorithms, or programming languages, and they may be provided in the same or different execution environment. This would be more reliable than having the group as replicas of each other since it can decrease the failure probability caused by faults in the replicas. 7) VotingReplica: To increase the probability that service invocation will return a correct result despite service faults, invocation is made to a group of functionally equivalent services in parallel. Given that there will be several responses from the group, one of the voting algorithms can be used to determine the final result of the invocation, e.g. majority voting. Here the group are replicas of each other; they can be different copies of the same service but are provided in different execution environments. 8) VotingNVP: This pattern is similar to 7) but adopts NVP. Here the services in the group are developed by different development teams or with different technologies, algorithms, or programming languages, but they may be provided in the same or different execution environment. 9) Retry + Wait: This pattern is an example of a possible combination of different patterns. When service invocation is

not successful, invocation is retried for a number of times and, if still unsuccessful, waits until a specified time before another invocation is made. All patterns except Wait employ redundancy. Retry is a form of time redundancy taking extra communication time to tolerate faults whereas RecoveryBlock, Active, and Voting employ space redundancy using extra resources to mask faults [10]. RecoveryBlock uses the passive replication technique; invocation is made to the original (primary) service first and alternatives (backup services) will be invoked only if the original service or other alternatives fail. Active and Voting both use the active replication technique; all services in a group execute a service request simultaneously, but they determine the final result differently. Retry, Wait, and RecoveryBlock can help tolerate system and network faults. Voting can be used to mask logic faults, e.g., when majority voting is used and the majority of service responses are correct. It can even detect logic faults if a correct response is known. Active can help with SLA faults that relate to late service responses. IV. SERVICE CHARACTERISTICS

The following are the criteria regarding service characteristics and condition of service execution environment which the service designer/provider will consider for a particular service. These characteristics will influence the recommendation of fault tolerance patterns for the service. 1) Transient Failure: The service environment is generally reliable and potential failure would only be transient. For example, the service may be inaccessible at times due to network problems, but a retry or invocation after a wait should be successful. 2) Instance Specificity: The service is specific and consumers are tied to use this particular service. It can be that there are no equivalent services provided by other providers, or the service maintains specific data of the consumers. For example, a CheckBalance service of a bank is specific because a customer can only check an account balance through the service of this bank with which he/she has an account, and not through the services of other banks. 3) Replica Provision: This relates to the ability of the service designer/provider to accommodate different replicas of the service. The replicas should be provided in different execution environments, e.g., on different machines or processing different copies of data. This ability helps improve reliability since service provision does not rely on a single service. 4) NVP Provision: This relates to the ability of the service designer/provider to accommodate different versions of the service. The service versions may be developed by different development teams or with different technologies, algorithms, or programming languages, and they may be provided in the same or different execution environment. This ability helps improve reliability since service provision does not rely on any single version of the service. 5) Correctness: The service designer expects that the service and execution environment should be managed to

50

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

provide correct results. This relates to the quality of service environment to provide reliable communication, including the mechanisms to check for correctness of messages even in the presence of logic faults. 6) Timeliness: The service designer expects that the service and execution environment should be managed to react quickly to requests and give timely results. 7) Simplicity: The service designer/provider may be concerned with simplicity of the service. Provision for fault tolerance can complicate service logic, add more interactions to the service, and increase latency of service access. When service provision is more complex, more faults can be introduced. 8) Economy: The service designer/provider may be concerned with the economy of making the service fault tolerant. Fault tolerance patterns consume extra time, costs, and computing resources. For example, sequential invocation is cheaper than parallel invocation to the group of services, and providing relplicas of the service is cheaper than NVP. V. FAULT TOLERANCE PATTERNS RECOMMENDATION

TABLE I.
Service Characteristics

RELATIONSHIP BETWEEN SERVICE CHARACTERISTICS AND FAULT TOLERANCE PATTERNS Fault Tolerance Patterns
Retry Wait RB
Replica

RB
NVP

Active Active Voting Voting Retry+ Wait Replica NVP Replica NVP

Transient Failure (TF) Instance Specificity (IS) Replica Provision (RP) NVP Provision (NP) Correctness (CO) Timeliness (TI) Simplicity (SI) Economy (EC)

8 8 0 0 2 4 8 7

7 8 0 0 2 1 8 8

0 7 8 0 3 5 7 6

0 6 0 8 4 6 6 5

0 5 8 0 5 7 5 4

0 4 0 8 6 8 4 3

0 5 8 0 7 2 3 2

0 4 0 8 8 3 2 1

7.5 8 0 0 2 2.5 8 7.5

The recommendation of fault tolerance patterns to a service is based on what characteristics the service possesses and which patterns suit such characteristics. A. Service Characteristics-Fault Tolerance Patterns Relationship We first define a relationship between service characteristics and fault tolerance patterns as in Table I. Each cell of the table represents the relationship level, i.e., how well the pattern can respond to the service characteristic. The relationship level ranges from 0 to 8 since there are eight basic patterns. Level 8 means the pattern responds very well to the characteristic, level 7 responds well, and so on. Level 0 means there is no relationship between the pattern and service characteristic. For example, for Economy, Retry and Wait are cheaper than other patterns that employ space redundancy since both of them require only one service implementation. But Wait responds best to economy (i.e., level 8) since there is only a single call to the service whereas Retry involves multiple invocations (i.e., level 7). Sequential invocation in RecoveryBlock is cheaper than parallel invocation in Active and Voting because not all service implementations will have to be invoked; a particular alternative of the service will be invoked only if the original service and other alternatives fail, whereas parallel invocation requires that different service implementations be invoked simultaneously. (level 6) is cheaper than RecoveryBlockReplica RecoveryBlockNVP (level 5) because providing replicas of the service should cost less than development of NVP. Similarly ActiveReplica (level 4) is cheaper than ActiveNVP (level 3) and VotingReplica (level 2) is cheaper than VotingNVP (level 1). Note that Voting is more expensive than Active due to development of a voting algorithm to determine the final result. For a combination of patterns such as Retry+Wait, the relationship level is an average of the levels of the combining patterns.

For the relationship between other characteristics and the patterns, we reason in a similar manner. Retry and Wait suit the environment with Transient Failure. The patterns that rely on the execution of a single service at a time respond better to Instance Specificity than those that employ multiple service implementations. Replica Provision and NVP Provision are relevant to the patterns that employ space redundancy. For Correctness, Voting is the best since it is the only pattern that can mask/detect byzantine failure (i.e., the case that the services give incorrect results). Active is better than RecoveryBlock with regard to byzantine failure because the chance of getting the result that is incorrect should be lower than the case of RecoveryBlock due to the fact that the result of Active can come from any one of the redundant services that are invoked in parallel. Retry and Wait do not suit Correctness since they rely on the execution of a single service. For Timeliness, the comparison of the patterns on time performance given in [2], [3] (ranked in descending order) is as follows: Active, RecoveryBlock, Retry, Voting, Wait. For Simplicity, the logic of Retry and Wait which involves a single service is the simplest. B. Assessment of Service Characteristics The next step is to have the service designer assess what characteristics the service possesses; the characteristics would influence pattern recommendation. 1) Identify Dominant Characteristics: The service designer will consider service semantics and condition of service provision, and identify dominant characteristics that should influence pattern recommendation. For each characteristic that is of concern, the service designer defines a dominance level. Level 1 means the characteristic is the most dominant (i.e., ranked 1st), level 2 means less dominant (i.e., ranked 2nd), and so on. Level 0 means the service does not have the characteristic or the characteristic is of no concern. For example, during the design of a CheckBalance service of a bank, the service designer considers Instance Specificity as the most dominant characteristic (i.e., dominance level 1) since

51

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

bank customers would be tied to their bank accounts that are associated with this particular service. From experience, the designer sees that the computing environment of the bank provides a reliable service and if there is a problem, it is generally transient, and hence a simple fault handling strategy is preferred (i.e., Transient Failure and Simplicity have dominance level 2). Nevertheless, the designer is able to afford exact replicas of the service if something more serious happens (i.e., Replica Provision has dominance level 3). Suppose the designer is not concerned with other characteristics, then the others would have dominance level 0. Table II shows the dominance level of all characteristics of this CheckBalance service. 2) Convert Dominance Level to Dominance Weight: a) Convert Dominance Level to Raw Score: The dominance level of each characteristic will be converted to a raw score. The most dominant characteristic gets the highest score which is equal to the dominance level of the least dominant characteristic that is considered. Less dominant characteristics get less scores accordingly. From the example of the CheckBalance service, Replica Provision has the least dominance level of 3, so the raw score of the most dominant characteristic Instance Specificity is 3. Then the score for Transient Failure and Simplicity would be 2, and Replica Provision gets 1. Table III shows the raw scores of the service characteristics. b) Compute Dominance Weight: First, divide 1 by the summation of the raw scores. For example, for the CheckBalance service, the summation of the raw scores in Table III is 8 (2+3+1+0+0+0+2+0) and the quotient would be 1/8 (0.125). Then, multiply this quotient with the raw score of each characteristic. The result would be the dominance weights of the characteristics (where the summation of the weights is 1). The weights will be used later in the recommendation model. For the CheckBalance service, the dominance weights of all characteristics are shown in Table IV. C. Fault Tolerance Patterns Recommendation Model We propose a model for fault tolerance patterns recommendation as in (1)
TABLE II. DOMINANCE LEVELS OF SERVICE CHARACTERISTICS Service Characteristics
Transient Instance Replica NVP Correct Timeli Simpli Econo Failure Specificity Provision Provision ness ness city my

TABLE IV.

DOMINANCE WEIGHTS OF SERVICE CHARACTERISTICS Service Characteristics

Transient Instance Replica NVP Correct Timeli Simplicity Econo Failure Specificity Provision Provision ness ness my

Score

0.25

0.375

0.125

0.25

P=DxR where P = A vector of fault tolerance pattern scores D = A vector of dominance weights of service characteristics as computed in Section V.B

(1)

R = A relationship matrix between service characteristics and fault tolerance patterns as proposed in Section V.A Therefore, given R as
Retry Wait RBReplica ActiveReplica VotingReplica Retry+Wait RBNVP ActiveNVP VotingNVP

8 8 0 R = 0 2 4 8 7

7 8 0 0 2 1 8 8

0 7 8 0 3 5 7 6

0 6 0 8 4 6 6 5

0 5 8 0 5 7 5 4

0 4 0 8 6 8 4 3

0 5 8 0 7 2 3 2

0 4 0 8 8 3 2 1

7.5 8 0 0 2 2.5 8 7.5

TF IS RP NP CO TI SI EC

and, in the case of the CheckBalance service, D as


TF IS RP NP CO TI SI EC

D = [ 0.25 0.375 0.125 0 0 0 0.25 0] . The pattern recommendation P would be


Retry Wait RBReplica ActiveReplica VotingReplica Retry+Wait RBNVP ActiveNVP VotingNVP

P = [ 7.00 6.75 5.38 3.75 4.12 2.50 3.62 2.00 6.88] .

Level

2 TABLE III.

RAW SCORES OF SERVICE CHARACTERISTICS Service Characteristics

Transient Instance Replica NVP Correct Timeli Simpli Econo ness city my Failure Specificity Provision Provision ness

Score

The recommendation says how well each pattern suits the service according to the characteristic assessment. The pattern with the highest score would be best suited for the service. Since the designer of the CheckBalance service pays most attention to Instance Specificity, Transient Failure, and Simplicity, the designer inclines to rely on reliable provision of a single service. The patterns that respond well to these characteristics, i.e, Retry, Wait, and Retry+Wait, are among the first to be recommended. Here, Retry is the best-suited pattern with the highest score. Since the designer can provide replica services as well but still has simplicity in mind, RecoveryBlockReplica is the next to be recommended. Voting

52

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

patterns and those which require NVP services are more complex strategies, so they get lower scores. D. Generation of Fault Tolerant Service A software tool has been developed to support fault tolerance patterns recommendation and generation of fault tolerant services as a BPEL service. The service designer will first be prompted to select service characteristics that are of interest, and then specify a dominance level for each chosen characteristic. The tool will calculate and rank the pattern scores as shown in Fig. 2 for the CheckBalance service. The designer can choose one of the recommended patterns and the tool will prompt the designer to specify the WSDL of the service together with any parameters necessary for the generation of the BPEL version. For Retry, the parameter is the number of retry times. For RecoveryBlock, Active, and Voting, the parameter is a set of WSDLs of all service implementations involved. For Wait, the parameter is the wait-until time. In this example, Retry is chosen and the number of retry times is 5. Then, a fault tolerant version of the service will be generated as a BPEL service for GlassFish ESB v2.2 as shown in Fig. 3. The BPEL version invokes the service in a fault tolerant way, implementing the pattern structure we adopt from [2], [7].

VI.

CONCLUSION

In this paper, we propose a model to recommend fault tolerance patterns to services. The recommendation considers service characteristics and condition of service environment. A supporting tool is developed to assist in the recommendation and generation of fault tolerant service versions as BPEL services. As mentioned earlier, it is a lightweight approach which helps to identify fault tolerance patterns that are likely to match service characteristics according to subjective assessment of service designers. At present the recommendation is aimed for a single service. The approach can be extended to accommodate pattern recommendation and generation of fault tolerant composite services. More combinations of patterns can also be supported. In addition, we are in the process of trying the model with services in business organizations for further evaluation. REFERENCES
M. P. Papazoglou, Web Services: Principles and Technology. Pearson Education Prentice Hall, 2008. [2] A. Liu, Q. Li, L. Huang, and M. Xiao, FACTS: A framework for fault tolerant composition of transactional Web services, IEEE Trans. on Services Computing, vol.3, no.1, 2010, pp. 46-59. [3] Z. Zheng and M. R. Lyu, An adaptive QOS-aware fault tolerance strategy for Web services, Empirical Software Engineering, vol.15, issue 4, 2010, pp. 323-345. [4] A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Trans. on Dependable and Secure Computing, vol.1, no.1, 2004, pp. 1133. [5] G. Dobson, Using WS-BPEL to implement software fault tolerance for Web services, In Procs. of 32nd EUROMICRO Conf. on Software Engineering and Advanced Applications (EUROMICRO-SEAA06), 2006, pp. 126-133. [6] J. Lau, L. C. Lung, J. D. S. Fraga, and G. S. Veronese, Designing fault tolerant Web services using BPEL, In Procs. of 7th IEEE/ACIS Int. Conf. on Computer and Information Science (ICIS 2008), 2008, pp. 618623. [7] T. Thaisongsuwan and T. Senivongse, Applying software fault tolerance patterns to WS-BPEL processes , In Procs. of Int. Joint Conf. on Computer Science and Software Engineering (JCSSE2011), 2011, pp. 269-274. [8] R. Hanmer, Patterns for Fault Tolerant Software. Chichester: Wiley Publishing, 2007. [9] Z. Zheng and M. R. Lyu, A QoS-aware fault tolerant middleware for dependable service composition, In Procs. of IEEE Int. Conf. on Dependable Systems & Networks (DSN 2009), 2009, pp. 239-249. [10] Z. Zheng and M. R. Lyu, Optimal fault tolerance strategy selection for Web services, Int. J. of Web Services Research, vol.7, issue 4, 2010, pp.21-40. [1]

Figure 2.

Pattern recommendation by supporting tool.

Figure 3.

BPEL structure for Retry.

53

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Development of Experience Base Ontology to Increase Competency of Semi-automated ICD-10-TM Coding System
Wansa Paoin
Faculty of Information Technology King Mongkuts University of Technology, North Bangkok Bangkok, Thailand wansa@tu.ac.th

Supot Nitsuwat
Faculty of Information Technology King Mongkuts University of Technology, North Bangkok Bangkok, Thailand sns@kmutnb.ac.th

Abstract The objectives of this research were to create the


International Classification of Diseases, 10th edition, Thai Modification - ICD-10-TM experience base ontology, to test usability of the ICD-10-TM experience base with knowledge base in a semi-automated ICD coding system, and to increase competency of the system. ICD-10-TM experience base ontology was created by collecting 4,880 anonymous patient records coded into ICD codes from 32 volunteer expert codes working in different hospitals. Data were checked for misspelling and mismatch elements and converted into experience base ontology using n-triple (N3) format of resource description framework. The semi-automated coding software could search experience base when initial searching from ICD knowledge base yielded no result. Competency of the semi-automated coding system was tested using another data set contain 14,982 diagnosis from 5,000 medical records of anonymous patients. All ICD codes produced by the semi-automated coding system were checked against the correct ICD codes validated by ICD expert coders. When the system use only ICD knowledge base for automated coding, it could find 7,142 ICD codes (47.67%), recall = 0.477, precision =0.909 , but when it used ICD knowledge base with experience base search, it could find 9,283 ICD codes (61.96%), recall = 0.677, precision = 0.928. This increase ability of the system was statistical significant (paired T-test p-value = 0.008 (< 0.05). This research demonstrated a novel mechanism to use experience base ontology to enhance competency of semi-automate ICD coding system. The model of interaction between knowledge base and experience base developed in this work could be used as a basic knowledge for development of other computer systems to compute intelligence answer for complex questions as well. Keywords-experience base, knowledge base ontology, semiautomated ICD coding system

ICD-10 is a classification that was created and maintained by the World Health Organization WHO since 1992 [2]. The electronic versions of ICD-10 was released in 2004 as a browsing software in CD-ROM package [3] and as ICD-10 online on WHO website [4]. Both electronic versions provided only a simple word search service that facilitate only minor part of the complex ICD coding processes. Since 2000 AD, some countries add more codes from medical expert opinions into ICD-10 so ICD-10 was modified in some countries e.g. Australia, Canada, Germany etc. In Thailand ICD-10 was modified to be ICD-10-TM (Thai Modification) since 2000 [5] and is maintained by Ministry of Public Health, Thailand . ICD coding is an important task for every hospital. After a medical doctor complete treatment for a patient, the doctor must summarized all diagnosis of the patient into a form of diagnosis and procedures summary. Then a clinical coder will start ICD coding for that case using manual ICD coding process which use two ICD books as reference sources. All ICD codes for each patient will be used for morbidity and mortality statistical analysis and reimbursement of medical care cost in hospital. Manually ICD coding processes are complex. The ICD coding could not be finished merely by word matching between diagnosis words and list of ICD codes/labels, a clinical coder may assign two different ICD codes for two patients with same diagnosis word based on each patient context. Unfortunately, this complexity of ICDcoding were not recognized by most researchers who tried to develop semi-automated and automated ICD coding systems in the past. Several research works mentioned automated ICD coding process in their researches. Diogene 2 program [6] built medical terminology table and used it to map diagnosis word into morphosemantem (word-form) layer, then converted the term into concept layer before matching to labels of ICD codes in expression layer. Heja et al [7] did matching diagnosis words with list of ICD code labels and suggested that hybrid model yield better matching results. Pakhomov et al [8]

I.

INTRODUCTION

Ontology is a data structure, a data representation tool to share and reuse knowledge between artificial intelligence systems which share a common vocabulary. Ontology could be used as a knowledge base for computer system to compute intelligence answer for complex questions like ICD-10-TM (The International Classification of Diseases and Related Health Problems, 10th Revision, Thai Modification) [1] coding.

54

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

designed an automated coding system to assign codes for outpatient diagnosis using example-based and machine learning techniques. Periera et al [9] built a semi-automated coding help system using an automated MeSH-based indexing system and a mapping between MeSH and ICD-10 extracted from the UMLS metathesaurus. These previous works, only used word matching approach processes and never covered full standard ICD coding processes, which had been summarized in ICD-10 volume-2 [10]. In our previous work [11] we had created ICD-10-TM ontology as a knowledge base for development of semiautomated ICD coding. ICD-10-TM ontology contains 2 main knowledge bases i.e. tabular list knowledge base and index knowledge base with 309,985 concepts and 162,092 relations. Tabular list knowledge base could be divided into upper level ontology, which defined hierarchical relationship between 22 ICD chapters, and lower level ontology which defined relations between chapters, blocks, categories, rubrics and basic elements (include, exclude, synonym etc.) of the ICD tabular list. Index knowledge base described relation between keywords, modifiers in general format and table format of the ICD index. ICD-10-TM ontology was implemented in semi-automated ICD-10-TM coding software as a knowledge base. The software was distributed by the Thai Health Coding Center, Ministry of Public Health, Thailand [12]. The coding algorithms will search matching keywords and modifiers from the index ontology and diagnosis knowledge base, then verify code definition, include and exclude conditions from tabular list ontology. The program will display all ICD-10-TM codes found or not found to the clinical coder, then the human coder could accept the codes or change to other codes based on her judgment and standard coding guideline. Users survey revealed good results got from ontology search with high user satisfaction (>95%) on well usability of the ontology. When we tried to use the system to do automate coding i.e. to code all diagnosis before a clinical coder start coding, to reduce number of diagnosis to be coded by clinical coder, we found that the automated coding work based on the ICD-10TM ontology could successfully code diagnosis words for 2450% of all diagnosis words. To increase competency of the system, we created another ontology call experience base to help the system to be able to code more diagnosis words than previously done. In this paper, we present ICD-10-TM experience base and the application of a novel mechanism using experience base ontology to enhance competency of semi-automate ICD coding system. The model of interaction between knowledge base and experience base developed in this work could be used as a basic knowledge for development of other computer systems to compute intelligence answer for complex questions as well.

II.

METHODOLOGY

To create knowledge base, we asked all expert coders in Thailand to volunteer participate in this project. An expert coder must had at least 10 years experience on ICD coding, or had passed the examination for certified coder (intermediate level) by the Thai Health Coding Centers, Ministry of Public Health to be able to participate. The project committee selected 42 expert coders from 198 volunteers based on their ability to devote time for the project, hospital size, hospital location where the coders work and competency on using computer and software. All selected expert coders attended one day training on how to use the semi-automated coding system. Each of them was assigned to use the system to do ICD coding. They used medical records of patients admitted into their hospital during January to November 2011 as input to the system. The input data did not include patient identification data. Only sex, age and obstetrics condition of each patient must be input into the system since these data elements, as well as all diagnosis words, are essential for ICD code selection by the system. Each expert coder must input at least 100 different cases into the system within 30 days. After finishing their task, each coder sent the saved data to the project coordinator by email. Data from all expert coders were checked for misspelling and mismatch elements (for example, a male patient could not be obstetrics case). Records of patient type with each diagnosis word and ICD code from every cases were created using ntriple (N3) format of resource description framework RDF [10] to built the experience base ontology. The ontology was built into the system using inverted index structure by transforming into Lucene 3.4 [13] search engine library which is the core engine of the semi-automated ICD coding system. The new semi-automated coding system now has another ontology - ICD experience base created from expert coders work. The automated coding algorithm had one new step. This step will be executed when searching from ICD knowledge base yielded no result. When ICD code was not found after searching from ICD knowledge base, the system will search from ICD experience base. Sometimes ICD code of a diagnosis with the same patient context varies from one expert opinion to another, the system will select the ICD code with highest frequency of expert opinion. Competency of the semi-automated coding system was tested using another set of patient data. This dataset contains 14,982 diagnosis from 5,000 medical records of patients admitted during January to June 2011, into another hospital which did not participate in the knowledge base creation. Every ICD codes in this dataset were validated for 100% accuracy by another three expert coders. All ICD codes produced by the semi-automated coding system when using knowledge base only and when using knowledge base with experience base were checked against the correct ICD codes in the dataset for accuracy.

55

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

III.

RESULTS

By the end of the project 4,880 diagnosis words and patient context were collected from 32 expert coders. Ten expert coders did not send the cases within the dateline, so their data were excluded from analysis in this phase. All 4,880 diagnosis words and patient context were used to created the experience base ontology. A python script written and used to transform each record from comma separate value file format to RDF N3 files. The experience base ontology contains five concepts and four relations as shown in Table 1. Each diagnosis word in a patient record could be uniquely identified. Each ICD expert opinion on the ICD code that should be used for each diagnosis word based on the patient context was an important concept in the ontology. All these concepts and relations were used to construct all RDF statements in the experience base ontology. For example if an expert abc123@mymail.com gave an opinion that a diagnosis word disseminated tuberculosis in a patient context man not newborn should be coded to ICD code A18.3, the RDF statements in the N3 format will be written as the following phrase;
dxword: disseminated_tuberculosis word:hasPtDxId ptdxid:001 . ptdxid:001 pt:isA ptcontext: man_not_newborn . ptdxid:001 icd:codeBy expert:abc123@mymail.com . ptdxid:001 icd:hasCode icd10:A183 .

Recall and precision of the system were calculated. The recall and precision value when the system used ICD knowledge base only were 0.477 and 0.909 , while the recall and precision value when the system used ICD knowledge base with experience base were 0.677 and 0.928.
Diagnosis Word Ptdx ID Patient Context

Dyslipidemia
101
Expert ID

Man, not newborn

ICD

abc 103 102

E78.5

abc Man, not newborn abc

Women, not newborn, not preg

E78.6

E78.5

Figure 1. A part of the ICD experience base. A diagnois word Dyslipidemia in each patient record could be code to various ICD codes, based on each expert opinion and each patient context.

IV.

DISCUSSION

The experience ontology concepts and relations can be presented as a graph data as in Figure 1.
TABLE I. Experience Base Concepts Diagnosis Word PatientDiag ID Patient Context Expert ICD10 Code hasPtDxId isA codeBy hasCode ALL EXPERIENCE BASE CONCEPTS AND RELATIONS IN RDF N3 FORMAT Ontology type Concept Concept Concept Concept Concept Relation Relation Relation Relation RDF Format dxword: ptdxid: ptcontext: expert: icd10:
word:hasPt DxId pt:isA icd:codeBy icd:hasCode

Example

ICD-10 coding is not a simple word matching process. Qualified human ICD coders will never do simple diagnosis word search or browse the diagnosis term from a list of ICD codes and labels. Unfortunately, research on semi-automated and automated ICD coding system in the past [6-9] never recognize this important concept. This finding explained why there is no real workable automated ICD coding system until now. ICD index and tabular list of disease were created since 1992, diagnosis words in ICD did not include every synonym, alternative name or some specific diagnosis in highly specialized medical service. On the other hand, ICD classification added some patient context into classification scheme, this made coding for one disease name may produced different ICD codes if the patient context change. For example an ICD code for diagnosis internal hemorrhoids would be O22.4 when the patient was a pregnant woman, but the code will be I84.2 for an adult man patient. These facts made ICD coding a complex job and need human coders. A clinical coder must know how to change some diagnosis word when first round searching could not find the code. She must had patient records in hand all the time she was coding to check necessary patient context that may affected correct ICD code choosing. Our semi-automated ICD coding system was not developed to replace all the clinical coders work on ICD coding. But if the system could find initial ICD codes for some diagnosis word summarized by the medical doctor, the coder works will be reduced in some extents. Our system used ICD ontology created from ICD-10-TM alphabetical index and tabular list of

dxword: disseminated_tuberculosis ptdxid:001 ptcontext:man_not_newborn expert:abc123@mymail.com icd10:A183 dxword:dyslipidemia word:hasPtDxId ptdxid:101 ptid:101 pt:isA ptcontext: man_not_newborn ptid:101 icd:codeBy expert:abc Ptdxid:101 icd:hadCode icd10:E78.5

The system was used to automate coded 14,982 diagnosis in the test dataset. When the system use only ICD knowledge base, it could find 7,142 ICD codes (47.67%), but when it used ICD knowledge base with experience base search, the system could find 9,283 ICD codes (61.96%). This increase ability was tested for statistical significant using paired T-test with alpha value = 0.05. T-stat = -79.30 with p-value = 0.008 (< 0.05).

56

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

disease as knowledge bases to search for correct ICD code for each diagnosis word + patient context. Automated coding base on this knowledge could code 47.67% of all diagnosis with good accuracy (90.9%). Recall ability of the old system was low because in real world medical records there are many varieties of words that the doctors may used for diagnosis. Some words are new words which occured after ICD-10 creation, for examples dyslipidemia, chronic kidney disease, diabetes mellitus type 2 are more common used by doctors today than the old words hyperlipidemia, chronic renal failure, non-insulin dependent diabetes mellitus found in ICD-10. Adding experience base created from real world cases into the system could increase recall ability of the system. ICD experience base ontology contains diagnosis words from real medical records with assigned ICD codes for these new words. So the system will search the experience base if first round searching from knowledge base yield no ICD code. Recall ability of the system increased from 0.477 to 0.677 with good precision ability (0.928). Different expert opinions for same diagnosis were anticipated to be found in the experience base. In facts a consensus of expert opinion was rarely found in ICD coding experience base. Varieties of expert opinions on coding of some diagnosis words were shown in Table 2. The system will choose code with the highest frequency to be used as a correct code. This strategy should be good unless there were too few opinions for some rare diagnosis words.
TABLE II.
Diagnosis words

from knowledge base yielded no result. The recall ability of the system could be increased by adding experience base searching into its algorithm with good precision ability still was preserved. ACKNOWLEDGMENT This research was supported by the Thai National Health Security Office, Thai Health Standard Coding Center (THCC), Ministry of Public Health, Thailand and Thai Collaborating Center for WHO-Family of International Classification. REFERENCES
[1] Bureau of Policy and Strategy, Ministry of Public Health, International Statistical Classification of Disease and Related Health Problems, 10th Revision, Thai Modification (ICD-10-TM). Nonthaburi, Thailand: The Ministry of Public Health, 2009. [2] The World Health Organization, International Statistical Classification of Diseases and Related Health Problems, 10th Revision. Geneva, Switzerland: The World Health Organization, 1992. [3] The World Health Organization, International Statistical Classification of Diseases and Related Health Problems, 10th Revision, 2nd Edition. Geneva, Switzerland: The World Health Organization, 2004. [4] The World Health Organization. ICD-10 online [internet]. Geneva, Switzerland: The World Health Organization; 2011 [cited 2011 Jun 30]. Available from http://www.who.int/classifications/icd/en/. [5] Bureau of Policy and Strategy, Ministry of Public Health, Thailand. International Statistical Classification of Disease and Related Health Problems, 10th Revision, Thai Modification (ICD-10-TM). Nonthaburi, Thailand: The Ministry of Public Health, Thailand: 2000. [6] C. Lovis, R. Buad, A.M. Rassinoux, P.A. Michel and J.R. Scherrer, Building medical dictionaries for patient encoding systems: A methodology, in: Artificial Intelligence in Medicine. Heidelberg: Springer, 1997, pp. 373380. [7] G. Heja and G. Surjan, Semi-automatic classification of clinical diagnoses with hybrid approach, in: Proceedings of the 15th symposium on computer based medical system - CBMS 2002. IEEE Computer Society Press; 2002,pp. 347352. [8] S.V.S. Pakhomov, J.D. Buntrock and C.G. Chute. Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques, J Am Med Inform Assoc, 2006, 13 pp.516 525. [9] S. Periera, A. Neveol , P. Masari and M. Joubert, Construction of a semiautomated ICD-10 coding help system to optimize medical and economic coding in A. Hasman et al, editors. Ubiquity: Technologies for Better Health in Aging Societies, VA: IOS Press, 2006 pp.845-850. [10] The World Health Organization. International Statistical Classification of Diseases and Related Health Problems, 10th Revision, 2nd Edition, Volume 2. Geneva, Switzerland: The World Health Organiztion; 2004. p.32. [11] S. Nitsuwat and W. Paoin, Development of ICD-10-TM ontology for semi-automated morbidity coding system in Thailand Methods of Information in Medicine, in press. [12] Semi-automated ICD-10-TM coding system [internet]. Nonthaburi, Thailand: The Thai Health Coding Center, Ministry of Public Health, Thailand; [cited 2011 Aug 12]. Available from : http://www.thcc.or.th/formbasic/regis.php. [13] RDF Notation 3 [internet]: The World Wide Web Consortium; [cited 2011 Jun 12]. Available from: http://www.w3.org/DesignIssues/Notation3. [14] Apache Lucene [internet]: The Apache Software Foundation; [cited 2012 Jan 24]. Available from http://lucene.apache.org/java/docs/index.html.

EXPERT OPINION OF SOME DIAGNOSIS WORD IN ICD EXPERIENCE BASE ONTOLOGY


ICD codes from expert opinion Highest frequency code

Dyslipidemia Chronic kidney disease Triple vessels disease Diabetes mellitus type 2

E78.5, E78.6, E78.9 N18.0, N18.9, N19 I21.4, I25.1, I25.9, N18.9 E11.9, E11

E78.5 (64.5%) N18.9 (35.5%) I25.1 (80%) E11.9 (95.8%)

Although the ICD experience base ontology at this stage contains only 4,880 cases. This experiment encouraged usage of experience ontology to increase recall ability of the semiautomated ICD coding system. In future research work, we plan to add more cases into the experience base and will try to test the ability of the system with more test data. V. CONCLUSION

ICD experience base ontology could be created using ICD codes from medical records which was coded by expert coders. This experience base ontology was implemented into the semi-automated ICD coding system. Searching from experience base was very useful when first round searching

57

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Collocation-Based Term Prediction for Academic Writing


Narisara Nakmaetee*, Maleerat Sodanil* * Faculty of Information Technology King Mongkuts University of Technology North Bangkok Bangkok, Thailand na_naris@hotmail.com, msn@kmutnb.ac.th
AbstractResearch paper is a kind of academic writing which is a formal writing. Academic writing should not contain any mistake, otherwise it would make the authors look unprofessional. In general, academic writing is a difficult task especially for non-native speakers. Appropriate vocabulary selection and perfect grammar are two of many important factors that make the writing appear formal. In this paper, we propose and compare various collocation-based feature sets for training classification models to predict verbs and verb tense patterns for academic writing. The proposed feature sets include n-grams of both Part-of-Speech (POS) and collocated terms preceding and following the predicted term. From the experimental results, using the combination of Part-Of-Speech (POS) and selected terms yielded the best accuracy of 50.21% for term prediction and 73.64% for verb tense prediction. Keywords: Academic writing; collocation; n-gram; Part-of-Speech (POS)

Choochart Haruechaiyasak Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center Pathumthani, Thailand choochart.haruechaiyasak@nectec.or.th

and grammar suggestion feature based on dictionary. Some of them provide general writing templates such as e-mail template and business letter template. From our review, we found that academic writing software cannot suggest suitable vocabularies for academic writing because they suggest vocabulary based on synonyms. The words, which are synonymous, may be formal or informal. Academic writing word should only be formal. In this paper, we focus on two factors that impact on formal writing: vocabulary selection and perfect grammar. For vocabulary selection, there are two associated problems. The first problem is the appropriate word selection. Non-native speakers often have difficulty in selecting an appropriate vocabulary for academic writing because they tend to look up for a word in a dictionary and use it without considering the word sense. They probably do not know exact meaning of the word. Moreover, they often tend to use very basic vocabulary, instead of a more sophisticated one. For example, given two sentences as follows. (1) We talk about the main advantages of our methodology. (2) We discuss the main advantages of our methodology. Even though sentences (1) and (2) have the same meaning, they use different verbs, talk about and discuss. Sentence (2) is more formal than sentence (1). The second problem is collocation. Collocation error is a common and persistent error type among non-native speakers. Due to the collocation error, a piece of writing may be lacked of significant knowledge which might cause loss of precision. For example, given the following two sentences, (3) Numerous NLP applications rely search engine queries. (4) Numerous NLP applications rely on search engine queries. Sentence (3) contains a common error often made by a non-native speaker. For perfect grammar, the non-native speakers frequently write wrong grammar such as fragment, wrong verb tense usage. In this paper, we focus on two specific tasks: verb prediction and verb tense pattern prediction. Verb prediction is for suggesting a verb in a sentence which is suitable

I. INTRODUCTION In broad definition, academic writing is any writing done to fulfill a requirement of a college or university [16]. There are several academic document types such as book report, essay, dissertation and research paper. Academic writing is different from general writing because it is formal writing. Many factors contribute to the formality of a text; major influences include vocabulary selection, perfect grammar and writing structures. For researchers, academic writing is an important channel to publish their new knowledge, ideas or arguments. Any mistakes should not occur in academic writing, because these mistakes will make the researcher look unprofessional. Moreover, errors in academic writing may result in research papers rejection. Thus, academic writing is a difficult task especially for nonnative speakers. At present, there are many software packages to help researchers write research papers. The software could be classified into two groups: academic writing software and grammar checker software. For academic writing software, they provide academic writing style templates such as APA, MLA and Chicago. Moreover, they provide the page layout controller, reference and citation features. For grammar checker software, they provide grammar checking feature, spelling feature,
58

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

for a given context. Verb tense pattern prediction is for suggesting correct tense for a given verb. The remainder of this paper is organized as follows. In the next section, we review some related works in academic writings. Section III gives the details of our proposed approach. The experiments and discussion are given in Section IV. We conclude the paper and give the direction for future works in Section V. II. RELATED WORKS There are some researches related to academic writing which can be classified into two groups: phrasal expression extraction [3][10] and word suggestion [4][13]. Phrasal expression extraction approach is based on statistic and rule based algorithm for suggesting useful phrasal expression. Word suggestion approach adopts probabilistic model or machine learning to discover word association and to build a model for word suggestion. Collocation is a group of two or more words that usually go together. It is useful for helping English learners improve their fluency. Moreover, we can predict the meaning of the expression form the meaning of the parts [7]. Consequently, collocation information is useful for natural language processing. Collocations include noun phase, phrasal verb, and the other stock phases [7]. However, in our study we focus on phrasal verb. There are many works related to collocation. Collocation has been presented with four groups: lexical ambiguity resolve [1][8][14], machine translation [5][6][11], collocation extraction [9][12], and collocation suggestion [4] [13] [15]. For collocation extraction and suggestion, Wrad Church and Hanks [12] proposed techniques that used mutual information to measure the association between words. Pearce [9] described a collocation extraction technique by using WordNet. The technique relied on a synonym mapping for each of word senses. Futagi [2] discussed how to deal with some of the non-pertinent English language learners in the development of an automated tool to detect miscollocations in learner texts significantly reduces possible tool errors. Their work focused on the factors that affected the design of collocation detection tool. Zaiu Inkpan and Hirst [15] presented an unsupervised method to acquire knowledge about the collocational behavior of near-synonyms. They used mutual information, Dice, chi-square, log-likelihood, and Fishers exact test to measure the degree of association between two words. Li-E Liu, Wible, and Tsao [4] proposed a probabilistic collocation suggestion model which incorporated three features: word association strength, semantic similarity and the notion of shared collocations. Wu, Chang, Mitamura, and S.Chang [13] introduced machine learning model based on result of classification to provide verb-noun collocation suggestion. They extracted the collocation which comprised components having a syntactic relationship with another one word. In this paper, we construct the

features set based on collocation: POS and collocated term. III. OUR PROPOSED APPROACH In this section, we describe the details of different feature set approaches for verb and verb tense pattern prediction. Both approaches are based on the collocation: Part-of-Speech (POS), and collocated terms. Fig. 1 illustrates the process for preparing the feature sets for training the prediction models. Firstly, we start by collecting a large number of research papers from the ACL Anthology website for developing our corpus. Secondly, we convert the pdf file format papers into text files and extract the abstracts from the text files. Next, we extract sentences from the documents. Then, we tokenize the input sentences into tokens. Furthermore, we tag the tokens in a sentence with the POS. Given a sentence from our corpus in (5), the process of POS tagging yields the result in sentence (6). The POS tag set is based on the Penn Treebank II guideline [18].

Figure 1. The process of feature set extraction

59

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

(5) More specifically this paper focuses on the robust extraction of Named Entities from speech input where a temporal mismatch between training and test corpora occurs. (6) More/RBR specifically/RB this/DT paper/NN focuses/VBZ on/IN the/DT robust/JJ extraction/NN of/IN Named/VBN Entities/NNS from/IN speech/NN input/NN where/WRB a/DT temporal/JJ mismatch/NN between/IN training/NN and/CC test/NN corpora/NN occurs/NNS ./. Next, we identify verb and verb tense pattern in each sentence. Table I and Table II give some examples of verb and verb tense pattern.
TABLE I. EXAMPLE SENTENCES FOR VERB PREDICTION Example sentences We present the technique of Virtual Annotation as a specialization of Predictive Annotation for answering definitional what is questions. This paper proposes a practical approach employing n-gram models and error correction rules for Thai key prediction and Thai English language identification. This paper investigates the use of linguistic knowledge in passage retrieval as part of an open-domain question answering system. In this paper, we demonstrate a discriminative approach to training simple word alignment models that are comparable in accuracy to the more complex generative models normally used. We evaluate the results through measuring the overlap of our clusters with clusters compiled manually by experts. Verb tag

TABLE II.

EXAMPLE POS TAGGED SENTENCES FOR VERB TENSE PATTERN PREDICTION

Example sentences This/DT demonstration/NN will/MD motivate/VB some/DT of/IN the/DT significant/JJ properties/NNS of/IN the/DT Galaxy/NNP Communicator/NNP Software/NNP Infrastructure/NNP and/CC show/VB how/WRB they/PRP support/VBP the/DT goals/NNS of/IN the/DT DARPA/NNP Communicator/NNP program/NN ./. First/RB we/PRP describe/VBP the/DT CU/NNP Communicator/NNP system/NN that/WDT integrates/VBZ speech/NN recognition/NNS synthesis/NN and/CC natural/JJ language/NN understanding/NN technologies/NNS using/VBG the/DT DARPA/NNP Hub/NNP Architecture/NNP ./. BravoBrava/NNP is/VBZ expanding/VBG the/DT repertoire/NN of/IN commercial/JJ user/NN interfaces/NNS by/IN incorporating/VBG multimodal/JJ techniques/NNS combining/VBG traditional/JJ point/NN and/CC click/NN interfaces/NNS with/IN speech/NN recognition/JJ speech/NN synthesis/NN and/CC gesture/NN recognition/NN ./. We/PRP have/VBP aligned/VBN Japanese/JJ and/CC English/JJ news/NN articles/NNS and/CC entences/NNS to/TO make/VB a/DT large/JJ parallel/NN corpus/NN ./. Recently/RB confusion/NN network/NN decoding/NN has/VBZ been/VBN applied/VBN in/IN machine/NN translation/NN system/NN combination/NN ./.

Verb tense pattern tag

/MD /VB

/VBP

present

propose

is /VBG

investigate

have /VBN

demonstrate

evaluate

has been /VBN

TABLE III.

EXAMPLE POS TAGGED SENTENCES FOR VERB TENSE PATTERN PREDICTION Example sentence

POS TAGGED Sentence N-gram term and POS Selected term POS TAGGED Sentence N-gram term and POS Selected term POS TAGGED Sentence N-gram term and POS Selected term

In

/IN

contrast

/NN

to

/TO

previous

/JJ

work pre3gram

/NN prePOS3 -gram

we pre2-gram preNoun/ prePronoun

/PRP prePOS2 -gram

particularly pre1-gram preAdv

/RB prePOS1gram

focus

/VBP

exclusively post1-gram postAdv

/RB postPOS 1-gram

on post2gram postPrepo

/IN postPOS2gram

clustering post3gram

/VBG postPOS3gram

polysemic

/JJ

verbs

/NNS

60

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE IV.

EXAMPLE POS TAGGED SENTENCES FOR VERB TENSE PATTERN PREDICTION Verb tense pattern feature set Term-only POS-only Term&POS {1-gram}, {2-gram}, {3-gram} {1-gram}, {2-gram}, {3-gram} {1-gram}, {2-gram}, {3-gram}

Verb feature set Term-only Term&POS Selected Term-only {1-gram}, {2-gram}, {3-gram} {1-gram}, {2-gram}, {3-gram} {3-gram, preNoun/prePronoun, postNoun}, {3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv}, {3-gram, preNoun/prePronoun, postNoun, postPrepo}, {3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv, postPrepo} {3-gram, preNoun/prePronoun, {3-gram, preNoun/prePronoun, postAdv}, {3-gram, preNoun/prePronoun, postPrepo}, {3-gram, preNoun/prePronoun, postAdv, postPrepo} postNoun}, postNoun, preAdv, postNoun, postNoun, preAdv,

Selected Term&POS

Selected Term&POS

{1-gram, preNoun/prePronoun, postNoun}, {2-gram, preNoun/prePronoun, postNoun}, {3-gram, preNoun/prePronoun, postNoun}

Then, we observe the POS tagged sentence and assign the feature labels as shown in Table III. From the observation of POS tagged sentences, we find that noun usually occurs in preceding and following position of a verb. Therefore, based on linguistic knowledge and observation, we select noun as part of our feature set. In table III, pre3-gram and prePOS3-gram denote a term and a Part-of-Speech (POS) that are the previous third position of verb. Features pre2-gram and prePOS2-gram denote a term and a POS that occur in the previous second position of verb. Features pre1-gram and prePOS1-gram denote a term and a POS that exist in the previous position of verb. post1-gram and postPOS1-gram indicate a term and a POS that are the following position of verb. Features post2-gram, postPOS2-gram, post3gram and postPOS3-gram indicate a term and a POS that exist in the second and third following position of verb. Feature preNoun/pronoun denotes a noun or pronoun that occurs in the preceding position of verb. Feature preAdv denotes an adverb that exists in the previous position of verb. Features postAdv and postPrepo indicate an adverb and a preposition that occur in the following position of verb. The final feature labels based on the selected terms and POS tags are shown in Table IV. Term-only indicates the feature sets that include the terms occur in the preceding and following position of a verb. However, there are three Term-only feature sets: 1gram Term-only, 2-gram Term-only, and 3-gram Term-only. Feature set 1-gram Term-only indicates a term that occurs in the previous and following position of a verb. Feature set 2-gram Term-only denotes two terms that exist in the first and second preceding and following position of a verb. Feature set 3-gram Term-only denotes three terms that occur in the first, second and third preceding and following position of a verb. POS-only represents the feature sets that consist of the POSs occur in the preceding
61

and following position of a verb. In the Table IV, there are three POS-only feature sets: 1-gram POS-only, 2-gram POS-only, and 3-gram POS-only. 1-gram POS-only indicates the feature set that includes a POS occur in the previous and following position of a verb. 2-gram POS-only denotes the feature set that includes two POSs occur in the first and second preceding and following position of a verb. 3-gram POS-only denotes the feature set that includes three POSs occur in the first, second and third previous and following position of a verb. Term&POS indicates the feature sets that consist of the terms and POSs occur in the preceding and following position of a verb. There are three Term&POS feature sets: 1-gram Term&POS, 2-gram Term&POS, and 3-gram Term&POS. 1-gram Term&POS indicates the feature set that includes one term and one POS occur in the previous and following position of a verb. 2gram Term&POS denotes the feature set that includes two terms and two POSs occur in the first and second preceding and following position of a verb. 3-gram Term&POS denotes the feature set that includes three terms and three POSs occur in the first, second and third preceding and following position of a verb. For Selected Term-only, it represents the feature sets include of the 3-gram terms and POSs, the nouns or pronoun, or the adverbs, or a preposition that occur in the preceding and following position of a verb. There are four feature sets. The first feature set includes of the 3-gram terms, the nouns or pronoun that occur in the previous and following position of a verb. The second feature set consists of the 3-gram terms, the nouns or pronoun and the adverbs that exist in the previous and following position of a verb. The third feature set consists of the 3-gram terms, the nouns or pronoun, and a preposition that occur in the previous and following position of a verb. The last feature set consists of the 3-gram terms, the nouns or pronoun, the adverbs, and a preposition that occur in the previous and following position of a verb. In table IV, there are two selected Term&POS

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

feature sets: Selected Term&POS verb feature set, Selected Term&POS verb tense pattern feature set. For Selected &POS verb feature set, there are four patterns. The first feature set includes of the 3-gram terms, the 3-gram POSs, the nouns or pronoun that occur in the preceding and following position of a verb. The second feature set consists of the 3-gram terms, the 3-gram POSs, the nouns or pronoun and the adverbs that occur in the preceding and following position of a verb. The third feature set consists of the 3-gram terms, the 3-gram POSs, the nouns or pronoun, and a preposition that exist in the preceding and following position of a verb. The last feature set consists of the 3-gram terms, the 3-gram POSs, the nouns or pronoun, the adverbs, and a preposition that occur in the preceding and following position of a verb. For Selected Term&POS verb tense pattern feature set, there are three patterns. The first feature set includes of the 1-gram terms, the 1-gram POS, the nouns or pronoun that occur in the preceding and following position of a verb. The second feature set consists of the 2-gram terms, the 2-gram POSs, the nouns that occur in the preceding and following position of a verb. The third feature set that consists of the 3-gram terms, the 3-gram POSs, the nouns that occur in the preceding and following position of a verb. IV. EXPERIMENTS AND DISCUSSION From the research paper archive, ACL Anthology [17], we collected 3,637 abstracts from ACL and HLT conferences from 2000 to 2011. We have extracted 15,151 sentences. To evaluate the performance of all feature set approaches, we use the Naive Bayes classification algorithm with 10-fold cross validation on the data set. For verb prediction, we selected the top 10 ranked verbs found in corpus. The top-10 verbs are be, describe, present, demonstrate, propose, achieve, use, evaluate, investigate, and compare. From our corpus, we selected 3,149 sentences which contain the above top-10 verbs for evaluating the verb prediction feature sets. There are 14 feature set approaches which can be classified into four groups: term-only, term&POS, selected term-only, and selected term&POS. Table V presents the performance evaluation of verb prediction feature sets based on the accuracy. From the table, it can be observed that the performance is improved when the number of n-gram increases. Using only POS does not increase the performance of verb prediction because POS tag is a linguistic category of word in the sentence. However, we found that the performance of POS and selected term of noun feature set is better than only selected term with noun feature set. Moreover, using adverb does not help increase the performance. On the other hand, preposition helps improve the performance. The reason is that some preposition is usually collocated with a verb such as rely on. In summary, the best feature set is by using 3-gram POS and selected terms of noun, pronoun, and preposition. The highest accuracy is equal to approximately 50%.
62

TABLE V.

EVALUATION RESULTS FOR FEATURES SETS OF VERB PREDICTION Approach 1-gram Accuracy (%) 42.89 48.10 49.03 42.27 47.31 48.14 49.16 48.99 50.05 49.32 49.29 49.16 50.21 49.95

Term-only

2-gram 3-gram 1-gram

Term&POS

2-gram 3-gram 3-gram, preNoun/pronoun, postNoun 3-gram, preNoun/pronoun, postNoun, preAdv, postAdv 3-gram, preNoun/pronoun, postNoun, postPrepo 3-gram, preNoun/pronoun, postNoun, preAdv, postAdv, postPrepo 3-gram, preNoun/pronoun, postNoun 3-gram, preNoun/pronoun, postNoun, preAdv, postAdv 3-gram, preNoun/pronoun, postNoun, postPrepo 3-gram, preNoun/pronoun, postNoun, preAdv, postAdv, postPrepo

Selected Term-only

Selected Term&POS

TABLE VI.

EVALUATION RESULTS FOR FEATURES SETS OF VERB TENSE PATTERN PREDICTION Accuracy (%) 68.71 70.27 69.99 67.52 65.58 65.14 72.72 70.96 70.49 73.64 72.44 71.92

Approach 1-gram Term-only 2-gram 3-gram 1-gram POS-only 2-gram 3-gram 1-gram Term&POS 2-gram 3-gram 1-gram Term&POS, preNoun/pronoun, postNoun Selected Term&POS 2-gram Term&POS, preNoun/pronoun, postNoun 3-gram Term&POS, preNoun/pronoun, postNoun

For verb tense pattern prediction, we used the corpus which contains 15,151 sentences. Similar to the verb prediction, verb tense pattern feature set can be classified into four groups: term-only, POS-only, term&POS, and selected term&POS. Table VI presents the performance evaluation based on accuracy. It can be observed that the performance of POS-only is quite low. When we combined selected terms with POS, the performance value was increased. However, the performance of POS and selected terms

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

of noun feature set is better than the other feature set. The best feature set is 1-gram POS and selected terms of preNoun and pronoun. The reason is noun and pronoun could provide very good clue in predicting the verb tense since they act as subjects in the sentence. V. CONCLUSION AND FUTURE WORKS We performed a comparative study on various feature sets for predicting verb and verb tense pattern in sentences. Four feature sets based on the Part-ofSpeech (POS) tags and selected terms, such as noun and pronoun, were evaluated in the experiments. We performed experiments by using the abstract corpus as data set and Naive Bayes as the classification algorithm. From the experiment results, verb prediction by using 3-gram of POS and selected terms of noun, pronoun, and preposition feature set yielded the best result of 50.21% accuracy. For the verb tense prediction, the highest accuracy of 73.64% was obtained by using 1-gram POS and selected terms of noun and pronoun. For our future work, we will improve the performance of verb prediction by using WordNet. WordNet is a large lexical database. Using WordNet will help find synonyms of a word with appropriate word sense. Moreover, instead of multi-class classification model, we will adopt the one-against-all classification model for improving the verb prediction results. REFERENCES
[1] [2] [3] D. Biber, Co-occurrence patterns among collocations: a tool for corpus-based lexical knowledge acquisition, Comput. Linguist. 19, pp. 531-538, 1993. Y. Futagi, The effects of learner errors on the development of a collocation detection tool, Proc. of the fourth workshop on Analytics for noisy unstructured text data, pp. 27-33, 2010. S. Kozawa, Y. Sakai, et al., Automatic Extraction of Phrasal Expression for Supporting English Academic Writing, Proc. of the 2nd KES International Symposium IDT 2010, pp.485-493, 2010. A. Li-E Liu, D. Wible, and N. Tsao, Automated Suggestions for Miscollocations, Proc. of the NAACL HLT Workshop on Innovative Use of NLP for Building Educational Applications, pp. 47-50, 2009. Z. Liu et al., Improving Statistical Machine Translation with monolingual collocation, Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 825833, 2010. Y. Lu and M. Zhou, Collocation translation acquisition using monolingual corpora, Proc. of the 42nd Annual Meeting on Association for Computational Linguistics, 2004. C. Manning and H. Schtze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999. D. Martinez and E. Agirre, One sense per collocation and genre/topic variations, Proc. of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, pp. 207-215, 2000. D. Pearce, Synonymy in Collocation Extraction, Proc. of the NAACL 2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, 2001.

[10] Y. Sakai, K. Sugiki, et al., Acquisition of useful expressions from English research papers, Natural Language Processing, SNLP '09, pp. 59-62, 2009. [11] F. Smadja et al., Translating collocations for bilingual lexicons: a statistical approach, Comput. Linguist. Volume 22, pp.1-38, 1996. [12] K. Wrad Church and P. Hanks, Word Association Norms, Mutual Information, and Lexicography, Proc. of the 27th Annual Meeting of the Association for Computational Linguistics, pp. 76-83, 1989. [13] J. Wu et al., Automatic Colloacation Suggestion in Academic Writing, Proc. of the ACL2010 Conference Short Papers, pp. 116-119, 2010. [14] D. Yarowsky, One sense per collocation, Proc. of the workshop on Human Language Technology, pp. 266-271, 1993. [15] D. Zaiu Inkpen and G. Hirst, Acquiring collocations for lexical choice between near-synonyms, Proc. of the ACL-02 workshop on Unsupervised lexical acquisition, pp. 67-76, 2002. [16] Academic writing definition, available at: http://reference.yourdictionary.com/worddefinitions/definition-of-academic-writing.html [17] ACL Anthology, available at: http://aclweb.org/anthology-new/ [18] Penn Treebank II Tags, available at: http://bulba.sdsu.edu/jeanette/thesis/PennTags.html

[4]

[5]

[6] [7] [8]

[9]

63

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Thai Poetry in Machine Translation


An Analysis of Poetry Translation using Statistical Machine Translation
Sajjaporn Waijanya
Faculty of Information Technology King Mongkuts University of Technology North Bangkok Bangkok, Thailand Sajjaporn.w@gmail.com

Anirach Mingkhwan
Faculty of Industrial and Technology Management King Mongkuts University of Technology North Bangkok Prachinburi, Thailand anirach@ieee.org For Thai Poetry and Thai Poet, Phra Sunthorn Vohara, known as Sunthorn Phu, (26 June 17861855) is Thailands best-known [3] royal poet. In 1986, the 200th anniversary of his birth, Sunthorn Phu was honored by UNESCO as a great world poet. His Phra Aphai Mani poems describe a fantastical world, where people of all races and religions live and interact together in harmony. But In Machine translation area, We never found the research about Thai poetry machine translation. Thai poetry has five major types are Klong, Chann, Khapp, Klonn and Raai. In this paper we use the Thai prosody "Klon-Pad (Klon Suphap)" in order to translate to English. Klon-Pad has the rules of syllable, Line (Wak), Baat, Bot and relational rule of syllable in each Wak [4]. There are relations to beauty in content of creative writing and different for the prosody. Thai poetry has complexity of rhyme and syllable. Each line (Wak) of Thai poetry is not a complete sentence (SOV-Subject Object Verb). Furthermore, some Thai words can have several meanings while translated to English. These are the reasons why it is difficult to develop Thai poetry machine translator. Our studies are about two Bot of Klon8 Thai Poetry translate by two statistically machine translator which are Google Translator [5] and Bing Translator [6]. Then we tune the prosody using a dictionary and compare result of English poetry with Thai prosody in section 3. We use a case study from Sakura, TaJ Mahal [7] by Professor Srisurang Poolthupya to have a reference in evaluation by BLEU (Bilingual Evaluation Understudy) metric in section 4. Section 5 concludes this paper and points out the possible further works in this direction. II. RELATED WORKS

AbstractThe poetry translation from original language to another is very different from general machine translator because the poem is written with prosody. Thai Poetry is composed with sets of syllables. Those rhymes, existing from stanzas, lines and the text in paragraph of the poetry, may not represent the complete syntax. This research is focus on Google and Bing machine translators and the tuning the prosody on syllable and rhyme. We have compared the errors (in percent), between the standard translators to those translators with tuning. The error rate of both translators before tune them, was at 97 % per rhyme. After tuning them, the percentage of errors decreased down to 60% per rhyme.. To evaluate the meaning, concerning the gained results of both kinds of translators, we use BLEU (Bilingual Evaluation Understudy) metric to compare between reference and candidate. BLEU score of Google is 0.287 and Bing is 0.215. We can conclude that machine translators cannot provide good result for translate Thai poetry. This research should be the initial point for a new kind of innovative machine translators to Thai poetry. Furthermore, it is a way to encourage Thai art created language to the global public as well. Keywords-Thai poetry translation; translation evaluation; Poem machine translator

I.

INTRODUCTION

Poetry is one of the fine arts in each country. The French poet Paul Valry defined poetry as "a language within a language."[1]. Poetry can tell a story, communicate by sound and sight and can simply express feelings. Poetry translation from original language to other languages is the way to propagate the own culture to other countries in the world. Machine translation of poetry is the challenge for researchers and developers [2]. According to Robert Frosts definition, Poetry is what gets lost in translation. This statement could be considered, its very difficult to translate poetry from original language to other languages with original prosody. This is because poetry has specific syntax (prosody) in the different poetry type. They different in line-length (number of syllable), rhyme, meter and pattern. Many researches try to develop poetry machine translator to translate Chinese poetry, Italian poetry, Japanese (Hiku) poetry, Spanish poetry to English and translate back from English to original language such as Poetry of William Shakespeare. They were developing poetry machine translation based on a statistical machine translation technique.

Although we cannot find any research related to machine translator of Thai poetry to English, there are several research papers related to machine translation poetry from Chinese to English, Italian to English and French to English. A. A Study of Computer Aided Poem Translation Appreciation [8] This paper collecting three English versions of Yellow Crane Tower a poem of the Tang dynasty, applies the computational linguistic techniques available for a quantitative

64

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

analysis, and use BLEU metrics for automatic machine translation evaluation. The conclusion of the currently available, computational linguistic technology is not capable of analyzing semantic calculation, which is, without a doubt, a severe drawback for poetry translation evaluation. B. Poetic Statistical Machine Translation: Rhyme and Meter[9] This is a paper of Google MT(Machine Translator) Lab. They use Google translator. Therefore they implement the ability to produce translations with meter and rhyme for phrase-based MT. They train a baseline phrase-based FrenchEnglish system using WMT-09 corpora for training and evaluation, and use a proprietary pronunciation module to provide phonetic representation of English words. The evaluation use BLEU score. The result of this research has the baseline BLEU score of 10.27. This baseline score is quite low and also has problem of system performance, it is still slow. C. Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation[10] This paper applies unsupervised learning to reveal wordstress patterns in a corpus of raw poetry and use these wordstress patterns, in addition to rhyme and discourse models, to generate English love poetry. Finally, they translate Italian poetry into English, choosing target realizations that conform to desired rhythmic patterns. In the section of poetry generation, FST (Finite State Transition) is used. However, this technology is having various problems, if the results have to be evaluated by humans. In part of poetry transition they use PBTM (Phrase base transition with meter). The advantage of poetry translation over generation is that the source text provides a coherent sequence of propositions and images, allowing the machine to focus on how to say instead of what to say. III. A. Methodology 1) Machine Translations MT (Machine translation) is sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another. MT has two major types. these are rule-base machine translation and Statistical Machine Translation Technology. a) Rule-based machine translation: Relies on countless built-in linguistic rules and millions of bilingual dictionaries for each language pair. The rule-based machine translation includes transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms. A typical English sentence consists of two major parts: noun phrase (NP) and verb phrase (VP). b) Statistical machine translation: based on bilingual text corpora. The statistical approach contrasts of the rule-base OUR PROPOSED APPROACH

approaches to machine translation as well as with examplebased machine translation. Both translators from Google and Bing use statistical machine translators. Moreover our team is using an API for Google and Bing translators to translate Thai poetry. 2) English Syllables Rule and Phonetics. Syllables are very important in prosody of Thai Poetry. Each Wak has a rule for number of syllables. Relation between Wak and Bot has to check the sound of the syllable. Every syllable consists of a nucleus and an optional coda. It is the part of the syllable used in poetic rhyme, and the part that is lengthened or stressed when a person elongates or stresses a word in speech. The simplest model of syllable structure [11] divides each syllable into an optional onset, an obligatory nucleus, and an optional coda. Figure 1 is showing the structure of syllable.

Figure 1. structure of syllable

Normally we can check the relation of rhymes, by checking them relation of the sound in the syllable. This is called Phonetics. It can help us, to get to know how to pronounce the word. B. An Algorithm and Case Study 1) System Flowchart To study Thai Poetry in Machine Translation, we use Thai poetry Klon-Pad 2 Bot (8 lines) as input to this process. Figure 2 is showing a system flowchart of this process.
Thai Poetry

1. Language Translator 2. Poetry Checking 3. Poetry Prosody Tuning


Thai Poetry Translator
Poetry in Eng
Poetry in Eng with Tuning

Figure 2. system flowchart of Thai poetry in Machine Translation

65

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

In Figure 2, we design three modules to translate Thai poetry to English. a) Language Translator: we use Google and Bing API Machine Translator to process Thai Poetry translates to English. b) Poetry Checking: used to check prosody of poetry after translate to English. The result of this module is Thai poetry in English and error point of the proetry itself. c) Poetry Prosody Tuning: after process module2 (Poetry Checking) we collect error points and tuning the poetry by using a dictionary and a thesaurus. An expected result of this module is the percentage of error will decrease. Case Study: we process twenty Klon-Pad Thai poetry via three modules without reference in term of English translation by professional, and we process one of Thai poetry from Sakura, TaJ Mahal by Professor Srisurang Poolthupya as reference and use result from Google and Bing API as reference to calculate BLEU score. We describe three modules in sub-section 2), 3) and 4) and in Figure 3 and Figure 4. 2) Language Translator Module This module process input Thai poetry (Klon-Pad) in term of Thai language to translate to English by Google and Bing API Machine Translator. Figure 3 is showing a process of this module.

b) Case Study1, Translate by Google API:


A herbaceous plant species growing easy. A good performance in film I Tawil. Some red, some yellow Ospin technician. Eidin flowers and clean the whole day. Month Movie I will be racing in the light shines. Just to name a synonym is real nice there. The fertilizer plant in projecting the profile. Keeps up the quality scale flattery.

c)

Case Study1, Translate by Bing API:


As the cultivation of plant species is very easy. Strong desire to make beautiful films a month name. Yellow and red are really sophin? Flores choetchin prominent pane all day. Last month, the race featured a Moonlight illuminates. Only the name Allied euphonic to life there. Even more sparkling variety fertilizer month projection We are too badly, the excessive praise.

3) Poetry Checking Module This module processes Thai poetry in English term from Google and Bing API. We analyses syntax and collected error points for prosody of Klon-Pad Thai poetry in twenty poetries. Figure 4 is showing a process of the Poetry Checking Module.

Thai Poetry in Eng

Check Line-Length (Number of Syllable)

Collect Error for number of syllable

Check Rhyme (relation of phonetic)

Collect error for rhyme

Figure 3. Language Translator Module

a) Case Study1,Original Thai Poetry: Thai Poetry Deuan-chaai from book Oh ja o dk mai oie.

Check words out of Vocabulary

Collect error for word out of vocab.

Poetry Checking
Thai Poetry in Eng with Error marks

Figure 4. Poetry Checking Module

a) Check Line Length (Number of Syllable): Thai poetry has prosody for number of syllable in line. In each line are 7 to 9 syllables allowed. If one line is having more than 9 or less than 7 syllables, an error is implicated in the length of the line.

66

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

From Case Study, Translate by Google API: we found 7 error lines as Table I below is showing.
TABLE I. AN EXAMPLE: THAI POERTY DEUAN-CHAAI TRANSLATEBY GOOGLE API.

Google Version A herbaceous plant species growing easy A good performance in film I Tawil. Some red, some yellow Ospin technician. Eidin flowers and clean the whole day. Month Movie I will be racing in the light shines. Just to name a synonym is real nice there. The fertilizer plant in projecting the profile. Keeps up the quality scale flattery.
.

Syllable Count 11 10 10 9a 12 11 13 10

a. 9 syllables is not error in prosody for number of syllable in line

TABLE II. AN EXAMPLE: THAI POERTY DEUAN-CHAAI TRANSLATE BY BING API.

Figure 5 show Thai poetry Klon-Pad Two Bot with 14 rules of Rhyme as flowing R1 relation of a1 and a2 or a1 and ax R2 relation of b1 and b2 R3 relation of b1 and b3 or b1 and bx R4 relation of b2 and b3 or b2 and bx R5 relation of b1, b2 and b3 or b1, b2 and bx R6 relation of c1 and c2 or c1 and cx R7 relation of d1 and d2 R8 relation of d1 and d3 R9 relation of d1 and d4 or d1 and dx R10 relation of d2 and d3 R11 relation of d2 and d4 or d2 and dx R12 relation of d2, d3 and d4 or d2,d3 and dx R13 relation of d3 and d4 or d3 and dx R14 relation of d1, d2, d3 and d4 or d1, d2, d3 and dx In this process, we check the relation of the syllables referred to the rule. A relation in Thai poetry means similar of pronunciations but it does not duplicate. Example 1: today relate with may, this is correct by rules of Rhyme. Example 2: today relate with Monday, this is error (duplicate) by rules of Rhyme. Example 3: today relate with tonight, this is error (not relate) by rules of Rhyme Case Study1, Translate by Google API:We found number of error 13 rule and correct in rule R3. Case Study1, Translate by Bing API:We found number of error 12 rule and correct in rule R1 and R3. c) Check Words out of Vocabulary: We used a dictionary and thesaurus to check the meaning of these words. We found out that MT tried to translate those words by write them in term of phoneme. Actually those words might have a meaning in Thai language, but it is to complex to translate them from Thai to English in only one step. Many words should first be translated from Thai to Thai, before they can be sent to MT. Those words, MT was not able to translate, we will furthermore call in this paper: Words out of Vocabulary. Moreover, these words get error tagged. Case Study1, Translate by Google API:We found 3 words out of vocabulary. There are Tawil, Ospin and Eid in. Tawil means to miss someone or to think of someone. Ospin means beautiful and Eidin means beautiful. Case Study1, Translate by Bing API:We found 2 words out of vocabulary. There are sophin and choetchin. Sophin means beautiful and choetchin menas beautiful. 4) Proetry Prosody Tuning. To study about basic tuning for Poetry translated by MT. Therefore we use twenty poetries in MT to test to approach. Our basic approaches are:a) Word out of vocabulary: translate Thai to Thai before translate by MT.

Bing Version As the cultivation of plant species is very easy. Strong desire to make beautiful films a month name. Yellow and red are really sophin? Flores choetchin prominent pane all day. Last month, the race featured a Moonlight illuminates Only the name Allied euphonic to life there. Even more sparkling variety fertilizer month projection We are too badly, the excessive praise.

Syllable Count

15 12 10 10 13 13 17 10

Table I and II represents the numbers of syllables in each wak. While using Google API, this poem has only one wak, in which this number of syllables is correctly translated. When Bing API was used, not a single wak had the correct number of syllables, they are all error tagged. b) Check Rhyme (Relation of Phonetic): Thai poetry has a rule for them Rhyme. For Klon-Pad we present the rule of Rhyme in figure 5.

Figure 5. Rhyme Prosody for Thai Poetry Klon-Pad (2 Bot)

67

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

b) Number of Syllable Error: the majority of the occurred errors, are having more syllables as they are allowed to have. Then we used a dictionary and thesaurus to reduce the lenght of the sentences by the help of shorter words. Afterwards an omission of the articles like "a", "and" as well as "the" was an additional possibility to decrease the lenght. c) Rhyme Error: we tune this error by the use of a dictionary and thesaurus to change words in Rhyme position. C. Measurment Design In this paper we separate two majors kind of measurement. 1) Error percentage We process twenty Thai poetry by calculates them prosody error percentage as shown in the equation below.

When Pn: Modified n-gram precision, Geometric mean of p1, p2,..pn. BP: Brevity penalty (c=length of MT hypothesis (candidate) , r=length of reference)

1 if c > r (5) BP = (1r / c ) if c r e In our baseline, we use N = 4 and uniform weights


wn=1/N IV. EXPERIMENT RESULTS In our experiments we translated twenty poetries by two machines translations which are Google API and Bing API. Both of MT is Statistical Machine Translation. In case study2 we use poetry from Sakura, TaJ Mahal by Professor Srisurang Poolthupya as reference and translate Thai poetry from this book by Google and Bing as candidates. This case study we evaluate by BLEU score. Finally, we summary the results as shown in the following part. A. Result of Thai poetry in Google and Bing Translator. In Table III, we show the percentage of errors from three error types before tuning those result. Most of these errors are mistakes of rhyme because MT is not able to understand poetry of rhyme and meter. In the column of Tuning the percentage of errors after tuning is shown in three parts.
TABLE III. PERCENT OF LINE -LENGTH ERROR, RHYME ERROR AND WORDS
OUT OF VOCABULARY BEFORE TUNING AND AFTER TUNING

Equation (1): Es means syllable error percentage of Bot, Ps means number of syllable error and Ts means total of Wak (8 Waks) in Bot. flowing. We calculate error percent of rhyme by equation (2)

(1)

Equation (2): Er means Rhyme error percentage of Bot , Pr means number of rhyme error and Tr means total Rhyme (14 rhyme position) in Bot. We calculate the error percentage related to the wrong used words by the help of a vocabulary. See the following equation (3)

(2)

Equation (3): EW describes the percentage of vocabulary errors per Bot. In this context PW is the number of wrong words and TW the total number of words per Bot. Maximal 72 words could be possible. Finally we calculate the average percentage of each error type for all twenty poetries. On this way we can create a summary to evaluate the results. 2) BLEU Score BLEU (Bilingual Evaluation Understudy) [12] is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human. BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text. Equation of BLUE is showed in equation 4.

Translator Items Google Google with Tuning Bing Bing with Tuning

(3)

Total line Line-length (Number of syllable Error) Percent of syllable Error Total Rhyme Rhyme Error Percent of Rhyme Error Total words Word Out of vocabulary Percent of Word Out of vocabulary

50 31% 271 97% 50 2%

160 28 18% 280 158 56% 1440 15 1%

87 54% 272 97% 87 3%

33 21% 147 62% 22 2%

B. Case Study2: poetry from Sakura, TaJ Mahaland BLEU Evaluation.: The original poetry in Thai and English show in table IV.
TABLE IV. THAI POETRY FROM BOOK SAKURA, TAJ MAHAL Original Thai Poetry Reference: Translate by owner of poetry Sunthon Phu, the great Thai poet, I pay my respect to you, my guru. May you grant me the flow of rhyme, Both in Thai and in English, That I may express my thoughts, In a fluent and precise way, Pleasing the audience and critics, Inspiring peace and well-being

BLEU = BP exp( wn log pn )


n =1

(4)

68

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

We use the original English Poetry as reference to compare them to both translators from Google and Bing. The calculated BLEU score is shown in Table V.
TABLE V. BLEU SCORE OF CANDIDATE FROM GOOGLE AND BING
TRANSLATOR

Poetry

Google

Bing

I bow my head respectfully Soonthornphu teachers. Please get to know us, this makes me respect. Please help with any poem. Thai English proficiently to process Various meanings can be very difficult. Relevant comparative information. Reading comprehension and critical mass. The prospectus provides a relaxed feel. Average of BLEU Score We also ketkrap the teacher verse harmonious Mussel Please recognize this request for a given by the audience. What a blessing you, help facilitate poem Fluent in English, Thai, tongkrabuan Describe the various not complicated Completely irrelevant comparisons. Catching someone reviews read all To help you relax, prospectus Average of BLEU Score

BLEU 0.840 0.905 0.000 0.549 0.000 0.000 0.000 0.000 0.287 0.840 0.000 0.309 0.574 0.000 0.000 0.000 0.000 0.215

This paper is the first research dealing about Machine Translation from Thai poetry to English. In the future, hopefully we are able to establish rules and poetry pattern to use those in combination with MT to translate Thai poetry to English with prosody keeping. The prosody and meaning of poetry are very important when translate to other languages because it can present arts and culture of that country. ACKNOWLEDGMENT This work is supporting poetries for translation by The Contemporary Poet Association and Professor Srisurang Poolthupya. Thanks also go to Google and Bing who are owner of famous Machine Translator.
REFERENCES [1] Poetry, How the Language Really Works: The Fundamentals of Critical Reading and Effective Writing. [online], Available: http://www.criticalreading.com/ poetry.htm Ylva Mazetti, Poetry Is What Gets Lost In Translation, [online], Available: http://www.squidproject.net/pdf/ 09_Mazetti_Poetry.pdf P.E.N. International Thailand-Centre Under the Royal Patronage of H.M. The King. Anusorn Sunthorn Phu 200 years. Amarin printing. 2529. ISBN 974-87416-1-3 Tumtavitikul, Apiluck. (2001). Thai Poetry: A Metrical Analysis. Essays in Tai Linguistics, M.R. Kalaya Tingsabadh and Arthur S. Abramson, eds. Bangkok: Chulalongkorn University, pp.29-40. Google Code, Google Translate API v2, [online], Available: http://code.google.com/apis/language/ translate/overview.html Bing Translator, online], Available: http://www.microsofttranslator.com/ Srisurang Poolthupya, Sakura Taj mahal, Bangkok, Thailand, 2010, pp. 1-2. Lixin Wang, , Dan Yang, Junguo Zhu, "A Study of Computer Aided Poem Translation Appreciation", Second International Symposium on Knowledge Acquisition and Modeling, 2009. Dmitriy Genzel, Jakob Uszkoreit, Franz Och, Poetic Statistical Machine Translation: Rhyme and Meter, Google, Inc., Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, USA, 2010, pp. 158166. Erica Greene, Tugba Bodrumlu, Kevin Knight Automatic Analysis of Rhythmic poetry with Applications to Generation and Translation, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, MIT, Massachusetts, USA, 9-11 October 2010, pp. 524533. Syllable rule, [online], Available http://www.phonicsontheweb.com/syllables.php Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA BLEU: a Method for Automatic Evaluation of Machine Translation Proceedings of the 40th Annual Meeting of the Association for, Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318. L. Balasundararaman, S. Ishwar, S.K. Ravindranath, Context Free Grammar for Natural Language Constructs an Implementation for Venpa Class of Tamil Poetry, in Proceedings of Tamil Internet, India, .2003 Martin Tsan WONG and Andy Hon Wai CHUN, "Automatic Haiku Generation Using VSM", 7th WSEAS Int. Conf. on APPLIED COMPUTER & APPLIED COMPUTATIONAL SCIENCE (ACACOS '08), Hangzhou, China, April 6-8, 2008.

[2] [3] [4] [5]

V.

CONCLUSION AND FUTURE WORK


[6] [7] [8] [9]

The generated results show that these Machine Translators have many problems by translating Poetry. MT translates poetry without prosody. It is not able to understand Poetry pattern, difficult original words and sentences itself. The reason for that is the operating principle of MT itself. They use Phrase based methods while translating from the original to another language. But Thai Poetry can be written in incomplete sentences. Moreover, Thai words especially words in poetry are very complex. Some words should be translated from Thai to Thai before they can be sending to MT. The reason why poets use more difficult words is a matter of them felling, the beauty of these words and also the beauty of the poetry itself. The result in this paper is show percent of error too high when we use only of MT to translate poetry especially Rhyme Error. Incidentally, it is possible to decrease the error rate down to 60% when tuning the results of MT. Moreover, the occurred errors of a backward translation from Thai to Thai could be decreased down between 1% and 2%, if the used words have been out of vocabulary. The results of BLEU score. In this paper we use only 1 reference for evaluation. In case of BLEU, if we have many references, it is better than only a single reference. However it is very difficult to find reliable references for such an evaluation, except such verified English translations from the owner of the original Thai poetry itself.

[10]

[11] [12]

[13]

[14]

69

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Keyword Recommendation for Academic Publications using Flexible N-gram


Rugpong Grachangpun1, Maleerat Sodanil2
Faculty of Information Technology King Mongkuts University of Technology North Bangkok Bangkok, Thailand rugpong.g@gmail.com1, msn@kmutnb.ac.th2

Choochart Haruechaiyasak
Human Language Technology (HLT) Laboratory National Electronics and Computer Technology Center Pathumthani, Thailand choochart.haruechaiyasak@nectec.or.th

AbstractThis paper presents a method of annotate keyword/keyphrase recommendation for academic literature. The proposed method is flexible in order to generate flexible lengths of phrases (flexible-gram of keywords/keyphrases) to increase the chance of accurate and descriptive results. Several techniques are applied such as parts of speech tagging (POS Tagging), term co-occurrence which measures correlationcoefficient and term frequency inverse document frequency (TFIDF) and finally weighting techniques. The results of the experiment were found to be very interesting. Moreover, comparisons against other algorithm keyword/keyphrase extraction techniques were also investigated by the author. Keywords-keywords recommendation; flexible N-gram; information retrieval; POS Tagging

methods of keywords/keyphrases annotation. Moreover, extraction from different written media material such as WebPages, Political Records, etc. Several models have also been used to cope with these tasks such as Fuzzy logic, Neural Network models and others [1,2,3,6]. But these two models suffer from identically the same disadvantages as follows, lack of speed, hard to be reused and the accuracy is quite low when compared to statistical models. Statistical models are also used widely in this area, mathematical and statistical data or term property provides reliable and accurate results referring to frequency and location. This study relies on not only statistical techniques to solve this problem but also combines a technique of Natural Language Processing called part-of-speech Tagging (POST) [4] in order to filter the answer set that is generated from our algorithm. The main objective of this experiment is to extract a set of keywords/keyphrases which are not firmly fixed in term of length from the results. This paper is arranged as follows, section II describes works that are related and have contributed to our experiment, section III describes the proposed framework, techniques and methodology which are used in this experiment, section IV focuses on the results and evaluation, and finally section V closes with a conclusion and future work. II. RELATED WORKS

I.

INTRODUCTION

At present, in the age of Information Technology, many academic literatures from many fields are published frequently and offered to readers via the Internet Network. Thus, searching for the desired document can sometimes be difficult due to the large volume of literature. If there were reliable ways for the information in these documents to be generated into accurate keywords and keyphrases to show the main idea or overall picture of the document it would be easier for readers to select the particular document they need. Keywords or keyphrases (multiple words that are combined) are a type of technique that is able to tell the readers quickly the basic roots of the document. Keywords/keyphrases do not only briefly tell readers the main idea of that document but also helps the people who professionally work with documents. For example, a librarian may take a long time to group enormous numbers of documents and arrange them on a shelf or in a data base. Thus, they may use keywords/keyphrases as a part of tool to classify those documents to groups. Automatically extracting keywords/keyphrases from a document is challenging due to working with the natural language and some other things that we will cover later. Over the last decade, there have been several studies proposing

In this section, previous related works are described which were helpful to our experiment. A. Term Frequency Inverse Document Frequency (TF-IDF) Traditionally, TF-IDF is used to measure a term importance by focusing on frequency of the term which appears in a topical document and corpus. TF-IDF can be computed by (1) [5,7].
TF-IDF = [fre(p,d)/n] [-log2 (dfp/N)] (1)

Herein, fre(p,d) is number of time phrase p occur in document d; n is number of mentioned words in document d; dfp is number of document in our corpus that contains

70

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

topical words; and N is the total number of documents in our corpus. B. Correlation Coefficient (r) Correlation Coefficient is a statistical technique which was used to measure the degree of relationship between a pair of objects or variables, here is a pair of terms. Correlation Coefficient can be represent as R, r and it can be computed by (2) nxy (x)(y) r= (2) n(x2)-( x)2 n(y2)-( y)2 Where n is the total number of pair of word; x and y are number of element x and y respectively. The value of r can be in the range of -1 < 0 < 1. When r is close to 1, the relationship between those element is very tight. When r is positive, that means x and y have a linier proportionally positive relationship, for example, y is increased when x increases. When r is negative, that means x and y have a liner negative relationship. For example, x increases but y decreases. In the case of r is 0 or close to it, x and y are not related to the other. Usually, in statistical theories, the relationship between two topical variables is strong when r is greater than 0.8 and weak while it is less than 0.5 [7]. In our experiment, correlation coefficient is taken to measure the degree of relationship in order to form a correct phrase. In section IV, the finest value of r is described and shows where it is applied to our experiment. C. Part-of-Speech Tagging (POST) Part-of-Speech Tagging (POST) is a technique or a process used in Natural Language Processing (NLP) [4]. POST is also called Grammatical Tagging or Word-Category Disambiguation [8]. The process is used to indentify word function such as Noun, Verb, Adjective, Adverb, Determiner, and so on [9]. Knowing word functions helps us to form an accurate phrase thats generated by a machine so it is readable and understood by a human. There are two different kinds of tagger [4]. The first is a stochastic tagger which uses statistical techniques and the second, is a rule-based tagger that focuses on peripheral words to find a tag and applies a word function for each term. Rulebased seems to be better than the other due to a words ability to have many functions which also affects the words meaning. In this study, rule-based tagging is applied to our experiment. In this study, the algorithm extracts the part-of-speech pattern (POS pattern) by transferring all keywords from training documents with the POS Tagger, and then we collect that pattern and put it in our repository. D. Performance Measurement The performance is measured in three parameters, they are Precision (P), Recall (R) and F score or Harmonic Mean. Those parameters are widely used in the study of Information Extraction. Precision tells us how well the algorithm found the right answers. Recall tells us how well it picked the right answers. And F score is used to measure the equilibrium of Precision and Recall. All of the above can be calculated by the following equation.

P = #Correct Extracted Words/Phrases #Retrieved Words/Phrases R = #Correct Extracted Words/Phrases #Relevant Words/Phrases F= 2PR P+R

(3)

(4)

(5)

III.

PROPOSED FRAMEWORK

In this experiment, the algorithm has two parts as follows Training phrase which creates the N-gram language model and extracts the POS pattern. Testing phrase which extracts candidate phrases from a new document and calculates the degree of phrase importance.

FIGURE 1. Proposed framework

A. Preprocessing Our experiment focuses on academic literatures. Thus, the source of raw documents must come from this area. All the data that is used comes from the academic literatures downloaded from IEEE, SpringerLink. After the documents are collected, they must be transformed into .txt format which is the task of the process called Preprocessing. Normally, the raw documents are in .pdf format. Those documents need to be converted into a .txt format with the reason of conveniently processing. Raw documents are actually sectioned into several major parts such as title, abstract, keyword and conclusion. Three of them are required, in this experiment, they are, title, abstract and conclusion. Thus, all words from those sections are collected. Herein, the process of preprocessing, the two units are very similar but they only extract raw content text from the three sections, nothing more is done in this training phase. B. N-Gram Process In this paper, we focused on two major techniques. Unigram Extraction is a simple step to extract all words which do not appear in the stopword list. Secondly, the bi-gram list, this extracts all possible phrases which do not begin or end with stopwords. In each list, more additional fields are also

71

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

added in order to increase the speed of processing. They are the number of documents that contain words/phrases and the number of times that words/phrases occur in the corpus. The criteria of tokenize possible words/phrases are Pair of words not mentioned as a phrase when they are divided by a punctuation mark, those marks are as follows full stop, comma, colon, semi-colon etc. Digits of Number are ignored. Any number is referred to as a stopword.

Where, DI is degree of important of each phrase; (fre(p,d)) is frequency of phrase p in document d; dfp is number of documents that contain phrase p in corpus and AW is Integer mark assigned to each section of document and IDF stands for Inverse Document Frequency which able to computed by second term of (1). AW is arbitrary assigned and adjusted until it generates the best result. The strategy of assigning the weighted score is to intuitively focus on both their physical and logical characteristics such as the size of space and the possibility of there being important information in each of the document sections. For instance, the space size of each area which is covered in this experiment is obvious different, Title is the smallest but, certainly, contains the most important information of a paper. Then, AW is then arbitrary assigned to each section until it generates the best result. The experiment provides the best result when Title, Abstract and Conclusion section are weighted at 7, 2 and 1 respectively. All phrases are computed for their DI in each section and then, finally, average value is computed. For example, the phrase information retrieval occurs 1,3,2 times in each section of document DI which is shown in (7). IDF {(12 7) + (32 2) + (22 1)} 63 (7)

C. Candidate Phrases Extraction All phrases from a converted document will be extracted using bi-grams. The bi-gram tokenization process is similar to the N-gram processing from the training phase but some of the criterion is different. The criterion focuses on ignoring words/phrases that appear only once [5], then its removed and replaced by a punctuation mark in order to tokenize the remaining phrase correctly. Tokenization across punctuation marks are not allowed. The reason of tokenizing all phrases with bi-grams is that most of the keywords/keyphrases are already composed of bigrams. Another reason is, if a phrase is tokenized as tri-grams, it is inconvenient to foreshorten or lengthen when it is tokenized in uni-gram. From our literature review, ratio of uni-gram:bi-grams:tri-grams is 1:6:3. Table I shows an example of tokenization.
TABLE I. EXAMPLE OF TOKENIZATION

DI =

The digit 3, at divisor, is the number of document section which are mentioned in this experiment. E. Phrase Filtration This process is about the filter before releasing the final result as keywords/keyphrase which is recommended by the algorithm. The result from the previous step may be lengthened as a tri-gram (the length is maximum in this experiment). As all phrases are computed in the previous processes into bi-grams, there might be some phrases which are not correct because there is a probability of a word missing. When a word is added in to these phrases, it should become more descriptive. For example, an expected phrase is natural language processing but the phrases in our list are natural language and language processing. In this case, we may concatenate those phrases by focusing on language as a joint. In the example just mentioned, the proposed technique worked properly but may not for other pairs of phrases. The Correlation Coefficient which is a statistic technique was applied to solve this problem in order to concatenate two phrases which have an identical joint instead of [12]. After that, some of the improper phrases may still remain due to improper arrangement. Thus, POS patterns obtained in the previous process are applied. We need to compare functions of each word in each phrase thats generated previously from our algorithm to patterns extracted from the corpus. We also have to focus on the subset of a word function. For example, the phrase of multiple compound sentences has a pattern, POS, as JJ, NN, NNS (adjective, singular noun, plural noun) on the other hand, our algorithm generates

Example Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR application. universities, public, libraries, IR, system, books, journals, document, web, search, engines, visible, IR and applications public library, IR systems, web search, search engine, visible IR, IR applications

Original

Uni-gram

Bi-gram

D. Weight Calculation and Ranking Weight calculation is used to score each phrase in our list which is called Rank. While the experiment was being conducted, (1) was applied to indentify words/phrases in the document but it did not generate well-enough results. Thus, (1) should be modified to gain a better end result. In this experiment, there are two parameters added, those are Area Weight (AW) and Word Frequency (f). f is term/phrase frequency in a document. Thus, (1) is modified as (6). DI = (fre(p,d))2
n IDF AW

(6)

72

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

multiple compound sentence which pattern is JJ, NN, NN (adjective, singular noun, singular noun) thus, those phrases are identical. In the case of a word function position is swapped, the phrase is discarded. IV. EXPERIMENTAL RESULT
F measure

Precision and Recall when CC and POS applied 60.00 55.00 50.00 45.00 40.00 35.00 30.00 25.00 20.00 15.00 10.00

In our experiment, the algorithm behavior and its result were observed, the best value of r is found at 0.2. Which means if r of natural and language and also r of language and processing, from the example described above, is greater than 0.2, those phrase are concatenated as natural language processing. The preset of r value in this experiment being lower than the general statistic theory proposed in section II B, could be due to the data set in the corpus being scattered. For instance, the phrase natural language occurs 31 times from 18 documents while natural occurs 28 times from 27 documents and 203 times from 53 documents for the word language. This algorithm was trained by 400 documents in the corpus and was applied to 30 academic literatures which were randomly downloaded from the same source of the set data used in the training phase. All literatures were also converted to .txt format before processing. Our algorithm is measure at different amounts of extracted phrases, at 1-10, 15 and at best (best means all phrases in proposed list are mentioned, precision and recall is calculate from the last matched phrase in the proposed answer list). In Fig. 2, the performance measurement of Recall referring to both Correlation Coefficient (CC) and POS pattern (POS) are shown and compared to the Correlation Coefficient application.

1 2 3 4 5 6 7 P 57.69 40.38 32.05 30.77 28.46 25.64 23.08 R 17.69 25.10 28.43 35.20 40.97 43.66 45.26

Figure 3. Deep detail measurement of Precision and Recall

In this paper, the author also presents the best number of phrases that the algorithm should propose, maximize spot. The best number of phrases is calculated by the F score. Considering figure 3 and 4, the algorithm is suitable to propose no less than 5 phrases to end-user.
Maximize Spot

35.00 33.00
Percentage

31.00 29.00 27.00 25.00


1 2 3 4 5 6 7

70.00 60.00 50.00 40.00 30.00 20.00 10.00 0.00 R_1 P_1 R_2 P_2

Percentage

F-Score 27.08 30.96 30.13 32.83 33.59 32.31 30.57

Figure 4. Maximize Spot


5 29.26 20.77 40.97 28.46 10 42.41 15.77 49.82 18.46 15 47.57 12.05 56.14 14.36 at best 59.28 34.19 60.11 39.62

Finally, our algorithm is compared to a standard method of keyword extraction, TF-IDF, meaning that the degree of importance of each term was calculated by (1). The result is showed in table II.
TABLE II. PERFORMANCE COMPARISON

Figure 2. Performance comparison with and without POST.

R_1, P_1 represents Recall and Precision when the Correlation Coefficient Technique is applied. R_2, P_2 represents Recall and Precision when the Correlation Coefficient and POST are applied.

Standard TF-IDF Proposed method

Average Performance (%) Recall Precision F score 47.53 14.37 22.07 60.11 39.62 47.83

Table III and IV shows an example of keyphrases from the proposed method.

73

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE III. EXAMPLE OF PROPOSED RESULT 1

Literature title Original Keywords

Concept Detection and Keyframe Extraction Using a Visual Thesaurus Concept Detection, Keyframe Extraction, Visual thesaurus, region types Visual Thesaurus model vector

achieve this result at 47.83% of precision. Furthermore, this study focuses on applying this method to develop an application for real situations. Therefore, the proposed model is built as simple as possible. The only disadvantage of the corpus is that the data set is not clustered in a unique narrow dimension but a broad one. But, the broad dimension of the dataset makes training sets rather natural and close to ordinary language which is the biggest advantage. Due to this experiment still in progress, there are some tasks that need to be revised. The author is planning to expand the size of corpus in term of the number of document in training set and to cover other fields of educational literature in order to observe a higher number of end results. ACKNOWLEDGMENT The authors would like to thank Asst.Prof. Dr.Supot Nitsuwat for sharing good ideas and his consultations. Dr.Gareth Clayton and Dr.Sageemas Na Wichian for statistical techniques and their experiences. Mr. Ian Barber for POST tool contribution and Acting Sub Lt. Nareudon Khajohnvudtitagoon for his development techniques. Last but not least, the faculty of Information Technology at King Mongkuts University of Technology. REFERENCES
[1] Md. R. Islam and Md. R. Islam, An Improved Keyword Extraction Method Using Graph Based Random Walk Model, 11th Int. Conference on Computer and Information Technology, pp. 225-299, 2008. [2] Z. Qingguo and Z. Chengzh, Automatic Chinese Keyword Extraction Based on KNN for Implicit Subject Extraction, Int. Symposium on Knowledge Acquisition and Modeling, pp. 689-602, 2008. [3] H. Liyanage and G.E.M.D.C.Bandara, Macro-Clustering: improved imformation retrieval using fuzzy logic, Proc. Of the 2004 IEEE Int. Symposium on Intellignet Control, pp.413-418, 2004. [4] E. Brill, A simple rule-based port of speech tagger, Proc. of the third conference on Applied natural language processings, pp. 152-155, 1992. [5] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin and C. G. NevillManning, KEA: Practical Automatic Keyphrase Extraction, Proc. of the fourth ACM conference on Digital libraries, ACM, 1999. [6] K. Sarkar, M. Nasipuri and S. Ghose, A new Apporach to keyphrase Extraction Using Neural Networks, Int. Journal of Computer Science Issues vol.7, Issue 2, 2010. [7] Mathbits.com, Correlation Coefficient. Available at: http://mathbits.com/mathbits/tisection/statistics2/correlation.htm [8] WikiPedia, Part-of-Speech Tagging. Available at: http://en.wikipedia.org/wiki/Part-of-speech_tagging [9] University of Pensylvania, Alphabetical list of part-of-speech tags used in the Penn Treebank project. Available at: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_p os.html [10] X. Hu and B. Wu, Automatic Keyword Extraction Using Linguisic Feature, Sixth IEEE International Conference on Data MiningWorkshop (ICDMW 06), pp. 19-23, 2006.

top 7 Keywords Proposed in bi-gram

Concept Detection region types keyframe extraction detection performance vector representation

top 3 Keywords Proposed in tri-gram

Concept detection performance shot detection scheme exploiting laent relations

TABLE IV. EXAMPLE OF PROPOSED RESULT 2

Literature title Original Keywords

An Improved Keyword Extraction Method Using Graph Based Random Walk Model keyword extraction, random walk model, mutual information, term Position, information gain Keyword Extraction improved keyword information gain extraction method method using mutual information extraction using

top 7 Keywords Proposed in bi-gram

top 3 Keywords Proposed in tri-gram V.

Random walk model mutual information gain using inspect benchmark DISCUSSION AND FUTURE WORK

This paper proposes an algorithm thats able to extract phrases that match more than half of the original keyphrases which are assigned by the author, meaning the result is determined as acceptable. Moreover, it uses less training sets to

74

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Using Example-based Machine Translation for English Vietnamese Translation


Minh Quang Nguyen, Dang Hung Tran and Thi Anh Le Pham
Software Engineering Department, Faculty of Information Technology Hanoi National University of Education {quangnm, hungtd, lepta}@hnue.edu.vn Abstract Recently,there is a significant amountof advantages in Machine Translation in Vietnam. Most approaches are based on the combination between grammar analyzing and a rule-based method. a statistic-based method or However, their results are still far from the human expectation.In this paper, we introduce a new approach which uses the example-based machine translation approach. The idea of this method is that using an aligned pair of sentences which is in Vietnamese and English and an algorithm to retrieve the most similar English sentence to the input sentence from the data resource. Then, we make a translation from the sentence retrieved. We applied the method to English-Vietnamese translation using bilingual corpus including 6000 sentence pairs. The system approaches feasible translation ability and also achieves a good performance. Compare to other methods applied in English-Vietnamese translation, our method can get a higher translation quality. I. Introduction Machinetranslation has been studied and developed for many decades. For Vietnamese, there are some projects which proposed several approaches. Most approaches used a system based on analyzing and reflecting grammar structure (e.g. rule-based and corporabased approaches). Among them, the rule-based approach is a trend of direction on this field nowadays; with bilingual corpus and grammatical rules built carefully [7]. One of the biggest difficulties in rule-based translation as well as other methods is data resources. An important resource that is required for translation is the thesaurus which needs lots of effort and work to build [9]. This dataset, however, do not meet the humans requirements yet. In addition, almost traditional methods also require knowledge about languages applied so it takes time to build a system for new languages [5, 6]. The Example Based Machine Translation (EBMT) is a new method, which relies on large corpora and tries somewhat to reject traditional linguistic notions [5]. EBMT systems are attractive in that output translations should more sensitive to contexts than rule-based systems, i.e. of higher quality in appropriateness and idiomaticity. Moreover, it requires a minimum of prior knowledge beyond the corpora which makes the example set, and are therefore quickly adapted to many language pairs [5]. EBMT is applied successfully in Japanese and American in some specific fields [1]. In Japanese, they built a system achieving a high-quality translation and also an efficient processing in Travel Expression [1]. In Vietnamese, however, theres no research following this method although the fact is that to apply in EnglishVietnamese translation, this method doesnt require too many resources and linguistic knowledge. We only have an English-Vietnamese Corpus data set in Ho Chi Minh National University the significant data resource with 40.000 pair of sentences (in Vietnamese and English) and about 5.500.000 words [8]. We already have the English thesaurus and EnglishVietnamese dictionary. About the set of aligned corpora, we have made 5.500 items for the research. In this paper, we use EBMT knowledge to build a system for English-Vietnamese translation. We will apply graph based method [1] to Vietnamese language. In this kind of paradigm, we have a set, each item in this set is a pair of two sentences: one in the source language and one in the target language. From an input sentence, we carry out from the set a item which is the most similar sentence to the input. Finally, from the example and the input sentence, we adjust to provide a final sentence in target language. Unfortunately, we dont have a Vietnamese, thesaurus so we proposed some solutions for this problem. In addition, this paper proposes a method to adapt the example sentence to provide the final translation. 1. EBMT overview: There are 3 components in a conventional example based system: - Matching Fragment Component. - Word Alignment. - Combination between the input and the example sentence carried out to provide the final target sentence. For example:

75

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

(1) He buys a book on international politics (2) a. He buys a notebook. b. Anh y mua mt quyn sch. (3) Anh y mua mt quyn sch v chnh tr th gii. With the input sentence (1), the translation (3) can be provided by matching and adapting from (2a, b). One of the main advantages of this method is that, we can improve the quality of translation easily by widen the amount example set. The more items add, the better we have. Its useful to apply for a specific field because the limit of form of the sentence included in these fields. For example, we use it to translate manuals of product, or weatherforecast, or medical diagnosis. The difficulty to apply EBMT in Vietnam is that, theres no word-net in Vietnamese, so we promote some new solutions to this problem. We build a system with 3 steps: - Form the set of example sentences, the result is the set of graphs. - Carry out the most popular example sentence to the input sentence. From an input sentence, using edit distance measuring, the system will find sentences which are the most similar to it. Edit- distance is used for fast approximate between sentences, the smaller distance, and the greater similarity between sentences.- Adjust the gap between the example and the input. 2. Data resource: We use 3 resources of data. That is: - Bilingual corpus: this is the set of example sentences. This set includes pairs of sentences. Each sentence is performed as a word sequence. Spreading the size of the set will improve the quality of translation. - The Thesaurus: A list of words showing similarities, differences, dependencies, and other relationships to each other. - Bilingual Dictionary: We used the popular English Vietnamese dictionary file provided by Socbay Company. 3. Build the graph of example set. The sentences are word sequences. We divide the words into 2 groups - Functional word: Functional words (or grammatical words or auto-semantic words) are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence, or specify the attitude or mood of the speaker. - Content word: Words that are not function words are called content words (or open class words or lexical words): these include nouns, verbs, adjectives, and most adverbs, although some adverbs are function words (e.g., then and why).

We classify the set into sub set. Each set includes sentences with the equal amount of content words and functional words. Based on the division, we build a group of graphs word graphs: - They are directed graphs including start node and goal node. They includes nodes and edges, an edge is labeled with a word. In addition, each edge has its own source node and destination node. - Each graph performs a sub set. Each sub set includes sentences with the same total of content word and the same total of functional word. - Each path from start node to goal node performs a candidate sentence. To optimize the system, we have to minimize the size of word graph. Common word sequences in different sentences use the same edge.

Figure 1: Example of Word Graph The word graphs have to be optimized with the minimum number of node. We use the method of converting finite state automata [3, 4]. After preparing all resources for this method, we will execute 2 steps of it: example retrieval and adaption: 4. Example retrieval: We use the A*Search algorithm to approach the most similar sentences from word graph. The result of matching between two word sequences is a sequence of substitution,deletions and insertions. The search process in a word graph is to find a least distance between the input sentence and all the candidates perform in graph. As a result, matching sequences of path are approached as records which include a label and one or two words. Exact match: E(word) Substitution: S(word, word) Deletion: D(word) Insertion: I (word) For example: Matching sequence between the input sentence We close the door and the example She closes the teary eyes is:

76

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

S(She, We) E(close) E(the) I(teary) S(eyes). The problem here is that we have to pick a sentence with the least distance to the input sentence. We firstly compare the total of E- records in each matching sequence, and then we compare S-records and so on. 5. Adaption: From the example approached, we adapt it to provide the final sentence in target language for input sentence by insertions, substitutions and deletions. To find the meaning of English words, we used morphological method. 5.1. Substitution, deletion and exaction: We will find the right position for the word in the final sentence for substitution, deletion and exaction. With deletion, we do nothing, but the problem here is that we have to find to meaning of word in substitution and deletion records. - There are some different meanings of a word, which one will be chosen? - Words in the dictionary are all in infinitive form while they can change to many other forms in the input sentence. We help to solve this problem carefully. Firstly, we find the type of word (noun, verb, adverb) in the sentence. We use Penn Tree Bank tagging system to specify the form of each word. Secondly, based on the form of word, we seek the word in the dictionary: If the word is plural (NNS): - If it ends with CHES, we try to delete ES and CHES, when the deletion makes an infinitive verb; we find the meaning in dictionary. Other case, it is specific noun. - If it ends with XES or OES, we delete XES or OES and find the meaning. - If it ends with IES: replace IES by Y. - If it ends with VES: replace VES by F or FE. - If it ends with ES: replace ES by S. - If it ends with S: delete S.

- Check the word if it is included in the list of irregular verb or not. If its included, we use the infinitive form to find the meaning. The list of irregular verb is performing as red-black tree to make the search easier and faster. - If it ends with IED: erase IED. - If it ends with ED: check the very last 2 letter before ED, if they are identical then we erase 3 last letter of word. Otherwise, we erase ED. If the word is in present continuous form, we find the word in the same way with gerunds. After that we add ang after the meaning. - If the word is JJS: Delete 3 and 4 last 4 consonant and find the meaning in the dictionary. After infinitive form of word is found, we use bilingual dictionary to seek the meaning. The problem is that, when we reach the infinitive form of word, since there are many meanings with a kind of words, we have to choose the right one. In our experiment, we take the first meaning in the bilingual dictionary. 5.2. Insertion: The problem here is that we dont know the exact position to fill the Vietnamese meaning. If we choose the position as the position of Insertion record in matching sequence, the final sentence in Vietnamese will be in low quality. We have to use the theory of ruled-based machine translation to solve this problem. We can use it in some specific phrase to find the better position instead of the order of records. Firstly, link grammar system will parse the grammatical structure of sentence. The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.).

After finding the meaning of plural, we add nhng before its meaning. If the word is gerund: - Delete ING at the end of the word. We try two cases. First is the word without ING and second is the word without ING and with IE at the end. If the word is VBP: - If the word is IS: its TO BE. If it ends with IES: replace IES by Y - If it ends with SSES: erase ES - If it ends with S: erase S If the word is in the past participle or past form:

From the grammatical structure of sentence, we find out some phrases in English which need to change the order of word to translate into Vietnamese. For example, the noun phrase nice book, with 2 I-records: I(nice) and I(book), we used to translate into hay quyn sch instead of quyn sch hay. With link grammar, we know the exact order to translate. Some phrases to process:

77

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Table 1: Some phrase to process with Link Grammar 1 2 3 4 5

6. Evaluation:

Noun phrase: POS(1, 2) = ({JJ}, {NN}). Reorder: ({NN}, {JJ}) Noun phrase: POS(1, 2, 3) = ({DT}, {JJ}, {NN}) && word1 = this, that, these, those. Reorder: ({NN}, {JJ}, {DT}) Noun phrase: POS(1, 2) = ({NN1}, {NN2}). Reorder: ({NN2}, {NN1}) Noun phrase: POS(1, 2) = ({PRP$}, {NN}). Reorder: ({NN}, {PRP$}) Noun phrase: POS(1, 2, 3) = ({JJ1}, {JJ2}, {NN}). Reorder: ({NN}, {JJ2}, {JJ1})

5.3. Example: Input sentence: This nice book has been bought Example retrieval: the most similar example with input sentence found out is This computer has been bought by him. Sequence of records: E(This) I(nice) S(computer, book) E(has) E(been) E(bought) D(by) D(him). With link grammar, there a noun phrase within the sentence This nice book, with the records E(This), I(nice), S(computer, book) respectively. We reorder the sequence: S(computer, book) I(nice) E(This) E(has) E(been) E(bought) D(by) D(him). Based on new records sequence and the example, the adaption phase will be processed: - Exact Match: Keep the order and the meaning of word. ny c mua - Substitution: Find the meaning of word in input sentence, replace the word in example by it. Quyn sch ny c mua - Deletion: Just erase the word in example. Quyn sch ny c mua - Insertion: We now have the right order of record, so we just finding the meaning of word in Insertion record and put it in order of the record in sequence. Quyn sch hay ny c mua . After 4 steps of adaption, we provide the final sentence: Quyn sch hay ny c mua

6.1. Experimental Condition: We made manually an English-Vietnamese corpus including 6000 pairs of sentences. To evaluate translation quality, we employed subjective measure. Each translation result was graded into one of four ranks by bilingual human translator who is native speaker in Vietnamese. The four ranks were: A: Perfect, no problem with both grammar and information. Translation quality is nearly equal to human translator. B: Fair. The translation is easy to understand but some grammatical mistake or missing some trivial information. C: Acceptable. The translation is broken but able to understand with effort. D: Nonsense: Important information was translated incorrectly. The English - Vietnamese dictionary used includes 70,000 words. To optimize the processing time, a threshold is used to limit the result set of Example retrieval phase. Table 2 show the threshold we used to optimize example retrieval phase with sentences length smaller than 30. If length of input sentence is greater than or equal to 30, threshold is 8. Table 2: Value of threshold

Length of sentence (words) Threshold

05 2

5 10

10 - 15 4

15 - 30 6

78

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

6.2. Performance: For the experiment, we create two test sets: a test set of random sentence with complex grammatical structure and a set of 50 sentences edited from the training set. Under these conditions, the average processing time is less than 0.5 second for each providing each translation. Although the processing time increases as the corpus size increases, the increasing scale is not linear but about a half power of corpus size. Compare to DP-matching [2], the method used to retrieve example with word graph and A*Search achieves efficient processing. Using the threshold 0.2 with random sentences, where the time processing is significantly decreased, and the translation quality is low. The reason is that we used the bilingual corpus with size is too small. As a result, examples approached are not similar enough to the input sentence. There are two ways to increase translation quality. Firstly, we widen the size of example set. Secondly, since we have not any appropriate way to choose the right meaning from bilingual dictionary, we apply the context-based translation to EBMT system. The tabe 3 and the table 4 illustrate the evaluation of result. Table 3: Set of edited sentences and performance

System can translate sentences with complex grammatical structure as long as the example approached is similar enough to the input sentences. 7. Conclusion: We report on a retrieval method for an EBMT system using edit-distance and evaluation of its performance using our corpus. In experiments for performance evaluation, we used bilingual corpus comprising 6000 sentences from every field. The reasons cause some low quality translation is the small size of bilingual corpus. The EBMT system will provide a better performance when it runs into a specific field. For example, we use EBMT to translate manuals of productions, or introductions in travel field. Experiment results show that the EBMT system achieved feasible translation ability, and also achieved effort processing by using the proposed retrieval method. Acknowledgements: The authors heartfelt thanks go to Professor Thu HuongNguyen, Computer Science Department, Hanoi University of Science and Technology for supporting the project, Socbay linguistic specialists for providing resources and helping us to test the system. Reference [1] Takao Doi, Hirofumi Yamamoto and Eiichiro Sumita.2005. Graph-based Retrievalfor Examplebased Machine Translation Using Edit-distance. [2] Eiichiro Sumita, 2001. Example-based machine translation using DP-matching between word sequences.

Rank Total Average length of sentence


Precision: 70%

A 25 9.3

B 11 6.3

C 3 7.8

D 11 8.4

Table 4: Set of random sentences and performance

Rank Total Average length of sentence


Precision: 50%

A 15 5.7

B 10 5.6

C 3 6.0

D 22 8

[3] John Edward Hopcroft and Jeffrey Ullman, 1979. Introduction to Automata Theory, Languages and Computation. Addison - Wesley, Reading, MA. [4] Janusz Antoni Brzozowski, Canonical regular expressions and minimal state graphs for definite events, Mathematical Theory of Automata, 1962, MRI Symposia Series, Polytechnic Press, Polytechnic Institute of Brooklyn, NY, 12, 529 561. [5] Steven S Ngai and Randy B Gullett, 2002. Example-Based Machine Translation: An Investigation. [6] Ralf Brown, 1996. Example-Based Machine Translation in the PanGloss System. In Proceedings of the Sixteenth International Conference on Computational Linguistics, Page 169-174, Copenhagen, Denmark.

For the set of edited sentences (table 3), the system reached high translation quality with the Precision of 70%. Items in this set have grammar structure and word type similar to example set, this makes EBMT system find the suitable sentences to translate. For the set of random sentences (table 4), because it contains a number of complex sentences, so that the examples EBMT system reach is not similar enough to the input, consequently, the result has low quality (Only 50% sentences translated with quality rank of A or B, the rest is at rank C or D).

79

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

[7] Michael Carl. 1999: Inducing Translation Templates for Example-Based Machine Translation, In the Proceeding of MT-Summit VII, Singapore. [8] inh in, 2002, Building a training corpus for word sense disambiguation in English-toVietnamese Machine Translation. 1994. Example-Based Machine Translation: A New Paradigm.

[9] Chunyu Kit, Haihua Pan and Jonathan J.Webste., 1994. Example-Based Machine Translation: A New Paradigm. [10] Kenji Imamura, Hideo Okuma, Taro Watanabe, and Eiichiro Sumita, 2004. Example-based Machine Translation Based on Syntactic Transfer with Statistical Models.

80

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Cross-Ratio Analysis for Building up The Robustness of Document Image Watermark


Wiyada Yawai
Department of Computer Science, Faculty of Science King Mongkuts Institute of Technology Ladkrabang Bangkok, Thailand E-mail: yawai.wiyada@gmail.com

Nualsawat Hiransakolwong
Department of Computer Science, Faculty of Science King Mongkuts Institute of Technology Ladkrabang Bangkok, Thailand E-mail: khnualsa@kmitl.ac.th

AbstractThis research presents the applied cross-ratio theory effectively used for building up the robustness of invisible watermarks which embedded in multi-language; English, Thai, Chinese, and Arabic, grayscale document image against the geometric distortion attacks; scaling, rotating, shearing and other manipulating, like noise signal adding, compression, sharpness, blur, brightness and contrast adjusting, that occurs while scanning the embedded watermarks for verifying them. These attacks are simulated to test the effectiveness of cross-ratio theory initiatively used to enhance this mentioned watermark robustness for any document image of any language which normally is the main limitation of other watermarking methods. This theory is using 4 corners and two diagonals of a document image as the reference for watermark embedding lines located between text lines crossing against two diagonals and vertical lines of both sides according to specified cross-ratio values. For watermark embedding positions on each line will be calculated from another set of cross-ratio. Cross-ratio values of each line will be different in accordance with preset patterns. Detection of watermarks in document images is not necessary to be converted image or compared with original image. Our approach can be detected through calculation of referred 4 corners of the image and applied correlation coefficient equation to directly compare against original watermarks. Testing revealed that it could build up reasonably robustness against scaling, at range 11% up, shearing ( 0 - 0.05), rotating (1 4 degrees), compressing, range 60% up, contrasting (1 45%), sharpness (0 100%) and blur filtering at mask size less than 13x13. Keywords-Digital watermarking; Document image; Robustness; Geometric distortion; Cross ratio; Collinear points

survived and easily detected even it has been attacked in many possible ways, especially geometric distortion attacks which mostly not been explored in other document image watermarking researches. Most existing researches are focused on watermarking an electronic text or document file, instead of document image, of one specified language, instead of multilanguage, and emphasized the watermark embedding technique instead of watermark robustness. These existing document watermarking researches can be categorized, by their watermarking technique, into 3 techniques as follows. Technique I: Watermark embedding with text document physical layout/pattern/structure rearrangement, such as shifting of lines [1] and words, particularly binding of word spacing, word shift coding or word classification [2][3][4][5]. This technique can be applied to both watermark electronic document file and document image. However, it has some disadvantages, for instance, line shifting technique of Brassil et will be low robust to document passing through document processing, page skewing/rotation; between -3o - +3 o, noise signal adding attack and a short text line. Another limitation of this process is that it can only apply to document with spaces between words, spacing of letters, shifting of baselines or line shift coding. Word shifting algorithm has also developed by Huang et al. [5], its based on adjusting inter-word spaces in a text document so that mean spaces across different lines show characteristics of a sine wave where information or watermark can be encoded or embedded in the sine wave(s) for both horizontal and vertical directions. Min Du et al. [6] proposed a text digital watermarking algorithm based on human visual redundancy. According to that the human eye is not sensitive to the slight change for text color, watermarks were embedded by changing the low-4 bits of RGB color components of characters. This proposed method has good invisibility and robustness which depending on its redundancy. However, this research tested its robustness against word deleting and modifying only. Technique II: Embedding text watermark by text character/letter feature modifying, for example, Brassil, et a l. [2] have used the letter adjusting by reducing or increasing length of letter; such as, increasing length of letters b, d, or h. For principle applying to extract hidden data out of document can be done by comparing hidden data document against original document. The limitation of this process is that the

I.

INTRODUCTION

Digital watermarking is one of the processes of hiding data for protecting copyright of digital media either in forms of audio, video, text, etc. There are two categories of watermarking; visible watermarking and invisible watermarking. The major purpose of watermarking is to protect copyright of media through creating various forms of obstructions to violators. This research is particularly emphasized on applying the cross ratio theory to create the robust watermark data embedded in the grayscale document image which must be

81

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

hidden data will be so little robust to document passing through document processing. Applying arithmetic expression to replace characteristic of letter with close component has developed by W. Zhang et al.[7] which has applied arithmetic expression to replace characteristic of letter with close component (in square form) which the hiding is done by adjusting sizes of those characters in document file. This process is robust against attacks or destruction and unable to observe. The test referred that the mentioned hiding is durable and more difficult to observe than those of the line-shift coding, word-shift coding, and character coding but has not presented the robustness testing information in the research document. However, this research is only be applied to Chinese characters and subject to be further researched since the process has tested only the character replacement attack, not yet against durability to various forms of watermark attacks. Shirali-Shahreza et al. [8] has applied changing of characters in Persian that a number most characters have their distinction in their peaks (Persian letter NOON) of these characters for hiding. Due to defects in using OCR in reading Persian and Arabic document image; therefore, reading of printed text from these characters for extracting hidden data is considered complicate, especially after attacking that has not yet tested. Suganya et al. [9] proposed to modify perceptually significant portions of an image to make the algorithm that watermark is hidden in the points location of the English letter i and j. first few bits are used to indicate the length of the hidden bits to be stored. Then the cover medium text is scanned to store a one, the point is slightly shifted up else it remains unchanged. However, this research did not refer to any robustness testing result. Technique III: Watermarking with semantic schemes or word/vowel substitution; Topkara et al. [10] has developed a technique for embedding secret data without changing the meaning of the text by replacing words in the text by synonyms. This method deteriorates the quality of the document and a large synonym dictionary is needed. Samphaiboon et al. [11] proposed a steganographic scheme for electronic Thai plain text documents that exploits redundancies in the way particular vowel, diacritical, and tonal symbols are composed in TIS-620, the standard Thai character set. The scheme is blind in that the original carrier text is not required for decoding. However, it can be used with only Thai text document and its watermark data is so easy to be destroyed by reediting with a word processing program. The following presenting research has been studied on text document image (not electronic document file), scanned or copied from an original document paper, watermarking by applying the cross-ratio theory in collinear point type, in order to build up its watermark robustness against various forms of attacks, particularly geometric distortions; scaling, shearing and rotation and other manipulations; data compression, noise signal adding and brightness, contrast, scale, sharpness and blur adjusting.

II. THE CROSS RATIO OF FOUR COLLINEAR POINTS The cross-ratio is a basic invariance in projective geometry (i.e., all other projective invariance can be derived from it). Here brief introduction to the cross-ratio invariance property is given. Let A, B, C, D be four collinear points (Three or more points A, B, C, are said to be collinear if they lie on a single straight line[12]) as shown in Fig. 1. The cross-ratio is defined as the double ratio in Eq. (1)

( A, B; C , D ) =

AC BD BC AD

(1)

where all the segments are thought to be signed. The cross-ratio does not depend on the selected direction of the line ABCD, but does depend on the relative position of the points and the order in which they are listed. Based on a fundamental theory, any homography preserves the cross-ratio. Thus central projection, linear scaling, skewing, rotation, and translation preserve the cross-ratio [13].

Figure 1. Collinear points A, B, C, and D

III.

ANALYSIS OF DIGITAL WATERMARKING FOR DOCUMENT IMAGE

To apply the cross ratio to digital image watermarking, three reference points are required. In this section, a method for deriving such reference points is detailed. A. Definition

C a Cb is line from an origin point C a to a destination point Cb where a = 1, 2, 3,4 and b = 1, 2, 3, 4


Cr = (CA/CD) : BA/BD = (CA/CD) (BD/BA) R = (AC/CD)/Cr BA = (AD * R) / (1 + R) DsB is distance of BA/AD is equal to the value of BA/DA B. Embedding Scheme Lets start by considering the embedding part. The method is described algorithmically below. 1) Predefine the set of cross-ratio values, to be used in subsequent steps. 2) Find the image center, as denoted by Dc , by using the line intersection formula [14] (two diagonal lines of the image) as described by Eqs.(2) ~ (3) below.

82

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

xc = xt / xb yc = yt / yb

(2) (3)

x RD , i = x2 + DsB ( x3 x2 ) y RD , i = y 2 + DsB ( y3 y 2 )
where

(10) (11)

x1 x 4 a x1 x 4 x = xt = b x3 x 2 b x3 x 2 a y1 y 4 x1 x 4 yt = y = b y3 y 2 b x3 x 2
where a=
x1 x4 y1 y4

y1 y 4 y3 y 2 y1 y 4 y3 y 2

( xRU , i , y RU , i ), i = 1, . M RU , is the coordinate of

the point DRU , i , A = C2, B = DRU , i , C= Dc and D = C3. In addition, ( xRD, i , y RD, i ), i = 1, . M RD , is the coordinate of the point DRD , i , A = C2, B = DRD , i , C= Dc and D = C3.

,b =

x3 x2

y3 y2

. In addition, ( xi , y i ) is

the coordinate of the point Ci , i = 1,...,4 (see Fig. 2).

xc is the x-axis value of the point Dc of two-line intersection; C1C4 intersect C 2C3 , and yc is the y-axis value of the same point and denotes a
From the above equations, determinant operator as shown in Fig. 2. 3) Find each of the primary-level watermark embedding points ( DLU ,i and DLD ,i ) on the left diagonal line (see Fig. 2) as described by Eqs.(4) ~ (7) below. Those points can be
(a)

identified by using two corner points of the left diagonal line (C1 and C4), in combination with the image center point

Dc as shown in Fig. 2(a) and the predefined cross-ratio values ( C r )


x LU ,i = x1 + DsB ( x 4 x1 ) y LU ,i = y1 + DsB ( y 4 y1 )
x LD ,i = x1 + DsB ( x4 x1 )
(4) (5) (6) (7)

y LD ,i = y1 + DsB ( y 4 y1 )
where

( xLU , i , y LU , i ), i = 1, . M LU , is the coordinate of

the point DLU , i , A = C1, B = DLU , i , C= Dc and D = C4 . In addition, ( x LD , i , y LD, i ), i = 1, . M LD , is the coordinate of the point DLD , i , A = C1, B = DLD , i , C= Dc and D = C4 . 4) Find each of the watermark embedding points ( DRU, i and DRD , i ) on the right diagonal line (see Fig. 2(b)) by following the steps and equation similar to those detailed in Step 3. However, now the point A in Eqs. (8) ~ (11) represents the point C2 while the point B now represents the point C3. By using these substitutions, those embedding points are given by (8) x RU , i = x2 + DsB ( x3 x2 )
(b) Figure 2. Notations of collinear points A, B, C, and D, defined in cross-ratio equation, on the left (a) and right (b) diagonal line of the document image.

y RU , i = y2 + DsB ( y3 y2 )

(9)

5) For each pair of DLU , i , DRU, i , and DLD , i , DRD , i levels, find an intersection, xi , y i ,of crossed line of each level drawn across left side; LLU ,1...LLD ,1 and right side; LRU ,1...LRD,1 of document image borders (see Fig. 3(a)); C1C3 and C 2C4 by applying Eqs. (12) ~ (13)

83

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

xi = xt / xb yi = yt / yb
xt = a x1 x2 x x y y xb = 1 2 1 2 b x3 x 4 x3 x4 y3 y4
a y1 y2 b y3 y4
x1 x2

(12) (13)

yt =

yb =

x1 x2 y1 y2 x3 x4 y3 y4
y3 y4

where

a =

x y1 , b= 3 x4 y2

6) Find each of the watermark embedding points ( E HU , i ,k and E HD , i ,k ) on the watermarked embedding lines (see Fig. 3(b)) Eqs. (14) ~ (17) represents the embedding points ( E HU , i ,k and E HD , i ,k ) are given by

(a)

x HU ,i ,k = x LU ,i + DsB ( x RU ,i x LU ,i ) y HU ,i ,k = y LU ,i + DsB ( y RU ,i y LU ,i ) x HD ,i , k = x LD ,i + DsB ( x RD ,i x RD ,i ) y HD ,i ,k = y LD ,i + DsB ( y RD ,i y RD ,i )


where ( x LU , i , y LU , i ), i = 1, .

(14) (15) (16) (17)

M LU , A = LLU ,i ,
C

B = E HU , i , k , C = DRU ,i and D = L RU ,i . In addition,


( x LD , i , y LD , i ), i = 1, . M LD , A = LLD ,i , B = E HD , i , k , = D RD ,i and D = LRD ,i .

7) From all watermark embedding points, embed the watermark patterns by means of a spread-spectrum principle [15] using the following equations

Given the set of watermark embedding points Ek = (xk, yk), k = 1, M, and each of the watermarking pattern bits wk, wk {1,-1}, k = 1. M, each watermarking pattern bit is embedded to the original image by using the following Eq. (18)

(b) Figure 3. (a) Notations of horizontal lines intersect with 2 diagonal lines and left and right border lines in text document image. (b) Notations of collinear points used for embedding 20 invisible actual watermark pattern bits in the document image of English, Thai, Chinese, and Arabic languages.

k k I e ( xm , yn )

k k I ( xm , yn ) + wk

(18)

C. Detection Scheme

k k where xm = xk + m , m = -P, . . . , P, yn = yk + n , n = -Q, . . . , Q and = strength of watermark.

To detect a watermark from the document image I e' , the four image corner points must first be detected. This can be achieved, for example, by using any of the existing corner detection algorithms. Once the four corner points are detected watermark embedding points must be identified. Each point

84

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

can be calculated by using the method similar to that of the embedding stage (see Section A. for details). By extracting the values of the pixels corresponding to those watermark ' embedding points, denoted by I e ( xk , yk ) , a watermark can be detected by using any of the existing watermark detectors. Here, we adopt the correlation coefficient detector [15]. The correlation coefficient value is computed by the following equation.

Z cc ( I e' , wk ) =
~ ~

~ ~ ( I e Wk ) ~ ~ ~ ~ ( I e I e )(W k W k )

After that testing the cross-ratio watermarking robustness with 9 attacks, including 3 geometric distortions; shearing, scaling and rotating and 6 manipulations; compression, sharpness, brightness, contrast, blur masking and noise signal adding. The actual watermark detecting results can be classified their effects into three groups as follows: Group I: No effect on actual watermark robustness has been found under attacking of sharpness which shown the correlation coefficient = 1, for all percentage of sharpness filtering variation, range 0 100%. Group II: Very low effect on actual watermark robustness has been found under attacking of compression, at the range 60 100 % of JPEG compression quality (see Fig. 5), scaling, at the range 11 60% of scaling factor (see Fig. 6), blur, at the range 3x3 13x13 of blur filtering mask size (see Fig. 7), contrasting, at the range 1 45%, shearing, at the range 0 0.05 and rotating, at the range of angle between 1- 4 degrees (see Fig. 8) which shown the acceptable correlation coefficient values between 0.5 - 1, for all kinds of attacking values specified above. Group III: High effect on actual watermark robustness has been found under attacking of brightness, at the range higher than 5%, Salt & Pepper noise signal adding, at the range higher than 1.5%, and Gaussians noise signal adding, at all ranges shown unacceptable correlation coefficient values which mostly near 0. It has shown that noise disturbing

(19)

' where I e = I e I e ,Wk = Wk Wk

Watermark is detected if the correlation coefficient value is greater than a detection threshold. For example, in the experiment that follows, the detection threshold is 0.5. IV. EXPERIMENT

Under this computer simulation experiment, 35 grayscale multi-language document images, size of 1240x1754 pixels, were used to add 20 different invisible actual watermark patterns of length 100 bits, is 3, block size of watermark 5x5 pixels/watermark pattern bit. The cross-ratio values used for watermark embedding and detecting were 120 values. The results of experiment for digital embedding in 35 grayscale document images comprising of the images with text in English, Thai, Chinese, and Arabic applied with 20 various watermark patterns (see Fig. 4), through measuring of watermark values from correlation coefficient by fixing threshold value equal to 0.5 (if there are watermarks in text document images with threshold value from 0.5 onward, if there is no watermarks its value must be less than 0.5) revealed the reasonable watermark robustness enhancement of the cross-ratio applying.

Figure 5. Correlation coefficient of watermarked multi-language document images which shown that all invisible actual watermarks still be reasonably detected after making the JPEG compression quality (%) down to 60% level.

Figure 4. Show some examples of 20 random invisible actual watermark pattern bits which were created and embedded between each text line of English, Thai, Chinese, and Arabic text document images.

Firstly, tested the controlled document image without watermark by comparing with watermark pattern could obtain value of correlation coefficient = 0, while the image with watermark pattern obtained value of correlation coefficient = 1.

Figure 6. Correlation coefficient of watermarked multi-language document images in scaling factors which shown its robustness if varied scaling factors between 11% and 120 %.

85

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

The experiment has also shown that it can be applied for all multi-language document images, not depending on specific language attributes like some methods mentioned above which mostly focused on testing only one specific language and not thoroughly explored the possible attacks which affect the watermark robustness. This is the original step of applying the cross ratio theory for grayscale multilanguage document image watermarking. For the next step we hope to improvise it to build up robustness significantly higher, especially resist the noise signal adding, rotating and brightness attacks.
Figure 7. Correlation coefficient of watermarked multi-language document images after attacked with blur filtering mask size between 3x3 and 15x15. It shown that its robustness could be kept at the blur filtering mask size 13x13. [1]

REFERENCES
J. T. Brassil, et al., Electronic Marking and Identification Techniques to Discourage Document Copying, IEEE Journal on Selected Areas in Communications, Vol.13, No.8, Oct 1995, pp.1495-1504. S.H. Low, N.F. Maxemchuk, J.T. Brassil, and L. OGorman, Document marking and identification using both line and word shifting, Proceedings of the Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM95), vol.2, 1995, pp. 853-860. Y. Kim, K. Moon, and I. Oh, A Text Watermarking Algorithm based on Word Classification and Inter-word space Statistics, Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR03), 2003, pp. 775-779. A.M. Alattar and O.M. Alattar, Watermarking electronic text documents containing justified paragraphs and irregular line spacing, Proceedings of SPIE Volume 5306, Security, Steganography, and Wateramaking of Multimedia Contents VI, 2004, pp. 685-695. D. Huang, and H. Yan, Interword Distance Changes Represented by Sine Waves for Watermarking Text Images, IEEE Trans. Circuits and Systems for Video Technology, Vol. 11, No. 12, pp. 1237-1245, 2001. Du Min and Zhao Quanyou, Text Watermarking Algorithm based on Human Visual Redundancy, AISS Journal, Advanced in Information Sciences and Service Sciences. Vol. 3, No. 5, pp. 229-235, 2011. W. Zhang, Z. Zeng, G. Pu, and H. Zhu, Chinese Text Watermarking Based on Occlusive Components, IEEE, pp. 1850-1854, 2006. M.H. Shirali-Shahreza, and M. Shirali-Shahreza, A New Approach to Persian/ Arabic Text Steganography, IEEE International Conference on Computer and Information Science, 2006. Ranganathan Suganya, Johnsha Ahamed, Ali, Kathirvel.K & Kumar, Mohan, Combined Text Watermarking, International Journal of Computer Science and Information Technologies, Vol. 1 (5) , pp. 414416, 2010. U. Topkara, M. Topkara, M. J. Atallah, The hiding Virtues of Ambiguity: Quantifiably Resilient Watermarking of Natural language Text through Synonym Substitutiions, In Proc. Of ACM Multimedia andSecurity Conference, 2006 Samphaiboon Natthawut, and Dailey Matthew N, "Steganography in Thai text ", In Proc. of 5th International Conference on Electrical EngineeringElectronics Computer Telecommunications and Information Technology , IEEE ECTI-CON 2008, pp. 133-136, 2008. Coxeter, H. S. M. and Greitzer, S. L. Collinearity and Concurrence., Geometry Revisited, Ch. 3, Math. Assoc. Amer, 1967, pp. 51-79. R. Mohr and L. Morin, Relative Positioning from Geometric Invariants, Proceedings of the Conference on Computer Vision and Pattern Recognition, 1991, pp. 139-144. Antonio, F. Faster Line Segment Intersection, Graphics Gems III, Ch. IV.6, Academic Press, 1999, pp. 199-202 and 500-501. J. Cox, M. L. Miller, and J. A. Bloom, [Digital Watermarking], Morgan Kaufmann Publishers, 2002

[2]

[3]

Figure 8. Correlation coefficient values of watermarked multi-language document images which rotation angle varied from 1 to 4 degrees which has still shown its robustness.

[4]

signals is the most complicated factor to affect watermark detecting, if there are more disturbing signals, detecting for watermarks can be difficult to be done.

[5]

[6]

V.

CONCLUSIONS

[7] [8]

The correlation coefficient measurement, acceptable values between 0.5 1, which has been used for detecting the invisible grayscale watermark existing on the multi-language document image file, has shown that the cross-ratio theory applying could be effectively used to build up the reasonably watermarking robustness against the geometric distortion attacks; scaling, especially at the range higher than 11%, shearing (0 0.05) and rotating (1 4 degrees) and some manipulating attacks; compressing, at the range higher than 60%, contrasting (1 45%), sharpness (0 100%) and blur filtering which mask size should not be greater than 13x13. This built-up robustness is based on four collinear points which have been used as the watermark embedding patterns and the referred points for watermark detection. It is not necessarily to be inversely transformed before detecting watermark positions, but can be directly detecting watermark position at once, and it is not necessarily compared with original document image without watermark. Confirmation of our document from watermark detecting can be proved directly through comparison of the existed watermark pattern.

[9]

[10]

[11]

[12] [13]

[14] [15]

86

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

PCA Based Handwritten Character Recognition System Using Support Vector Machine & Neural Network
Ravi Sheth1, N C Chauhan2, Mahesh M Goyani3, Kinjal A Mehta4
1 2

Information Technology Dept., A.D Patel Institute of Technology, New V V nagar-388120, Gujarat, India Information Technology Dept., A.D.Patel Institute of Technology, New V V nagar-388121, Gujarat, India 3 Computer Engineering. Dept., L.D.college of engineering, Ahmadabad, Gujarat, India 4 Electronics and Communication Dept., L.D. college of engineering, Ahmadabad, Gujarat, India
1

raviesheth@gmail.com
the communication between machine and men. Generally HCR is divided in four major parts as shown in Fig. 1[4]. These phases include binarization, segmentation, feature extraction and classification. Few major problems faced while dealing with segmented, handwritten character recognition is the ambiguity and illegibility of the characters. The accurate recognition of segmented characters is important for the recognition of word based on segmentation [5]. Feature extraction is most difficult part in HCR system.

Abstract Pattern recognition deals with categorization of input data into one of the given classes based on extraction of features. Handwritten Character Recognition (HCR) is one of the wellknown applications of pattern recognition. For any recognition system, an important part is feature extraction. A proper feature extraction method can increase the recognition ratio. In this paper, a Principal Component Analysis (PCA) based feature extraction method is investigated for developing HCR system. PCA is a useful statistical technique that has found application in fields such as face recognition and image compression, and is a common technique for finding patterns in data of high dimension. These method have been used as features of the character image, which have been later on used for training and testing with Neural Network (NN) and Support Vector Machine (SVM) classifiers. HCR is also implemented with PCA and Euclidean distance. Keywords: Pattern recognition, handwritten character recognition, feature extraction, principal component analysis, neural network, support vector machine, euclidean distance.

Input

Binarization

Segmentation

Feature Extraction

Output

Classification

Figure 1: Block diagram of HCR system.

I.

INTRODUCTION But before recognition, the handwritten characters have to be processed to make them suitable for recognition. Here, we consider the processing of entire document containing multiple lines and many characters in each line. Our aim is to recognize characters from the entire document. The handwritten document has to be free from noise, skewness, etc. The lines and words have to be segmented. The characters of any word have to be free from any slant angle so that the characters can be separated for recognition. By this assumption, we try to avoid a more difficult case of cursive writing. Segmentation of unconstrained handwritten text line is difficult because of inter-line distance variability, base-line skew variability, different font size and age of document [5]. During the next step of this process features are extracted from the segmented character. Feature extraction is a very important part in character recognition process. Extracted feature has been applied to classifiers which recognized character based on trained features. In second section, we have described

andwritten character recognition is an area of pattern recognition that has become the subject of research during the last few decades. Handwriting recognition has always been a challenging task in pattern recognition. Many systems and classification algorithms have been proposed in the past years. Techniques ranging from statistical methods such as PCA and Fisher discriminate analysis [1] to machine learning like neural networks [2] or support vector machines [3] have been applied to solve this problem. The aim of this paper is to recognize the handwritten English character by using PCA with three different methods as mentioned above. The handwritten characters have infinite variety of style varying from person to person. Due to this wide range of variability, it is very difficult for a machine to recognize a handwritten character; the ultimate target is still out of reach. There is a huge scope of development in the field of handwritten character recognition. Any future process in the field of handwritten character recognition will able to increase

87

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

feature extraction method in brief and described principal component analysis method. In the next session we have discussed neural network and SVM and Euclidean distance methodology. II. FEATURE EXTRCTION

Eigenvector and Eigenvalues The eigenvectors of a square matrix are the nonzero vectors that, after being multiplied by the matrix, remain proportional to the original vector (i.e., change only in magnitude, not in direction). For each eigenvector, the corresponding eigenvalues is the factor by which the eigenvector changes when multiplied by the matrix. Another property of eigenvectors is that even if we scale the vector by some amount before we multiply it, we still get the same multiple of it as a result. Another important thing to known is that when mathematicians find eigenvectors, they like to find the eigenvectors whose length is exactly one. This is because, as we know, the length of a vector doesnt affect whether its an eigenvector or not, whereas the direction does. So, in order to keep eigenvectors standard, whenever we find an eigenvector we usually scale it to make it have a length of 1, so that all eigenvectors have the same length [7].

Any given image can be decomposed into several features. The term feature refers to similar characteristics. Therefore, the main objective of a feature extraction technique is to accurately retrieve these features. The term feature extraction can thus be taken to include a very broad range of techniques and processes to the generation, update and maintenance of discrete feature objects or images [6]. Feature extraction is the most difficult part in HCR system. This approach gives the recognizer more control over the properties used in identification. Character classification task recognizes the character which is compared with the standard value that comes out the learning character, and the character should be corresponded to the document image that is matching a setting document style in the document style setting part. Here we have investigated and developed PCA based feature extraction method. Principal component analysis

Steps for generating principle character and digit images:

components of

PCA is a useful statistical technique that has found application in fields such as face recognition and image compression, and is a common technique for finding patterns in data of high dimension [7]. It is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of high dimension, where the graphical representation is not available, PCA is a powerful tool for analyzing data [7]. The other main advantage of PCA is that once you have found these patterns in the data, and you compress the data, i.e. by reducing the number of dimensions, without much loss of information [7]. Principal component analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has as high a variance as possible (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (uncorrelated with) the preceding components. Principal components are guaranteed to be independent only if the data set is jointly normally distributed. Before starting methodology first of all its important to discuss following term which are related to PCA [7].

Step 1: Get some data and find mean of each data In this work we have used our own made-up data set. Data set is nothing but handwritten character A-J and 1-5 digits which contains 30 samples of each character or digit. And find the mean using equation 5.

1 N

(1)

k 1

Where, M=Mean, N=Total no. of i/p images, X= I/p image Step 2: Subtract the mean For PCA to work properly, we have subtracted the mean from each of the data dimensions. The mean subtracted is the average across each dimension (use equation 2), where M is a mean which we have calculated using equation 1.So, all the

X values have X (the mean of the x values of all the data points) subtracted, and all the Y values have Y subtracted
from them. This produces a data set whose mean is zero.

Xn M
Step 3: Calculate the covariance matrix Next step is to find out covariance matrix using equation 3.

(2)

1 N

X
n k 1

M X

(3)

Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix Since the covariance matrix is squared, we have calculated the eigenvectors and eigenvalues for this matrix. By this process

88

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

of taking the eigenvectors of the covariance matrix, we have been able to extract lines that characterize the data. The rest of the steps involve transforming the data so that it is expressed in terms of them lines. Step 5: Choosing components and forming a feature vector Here is where the notion of data compression and reduced dimensionality comes into it. In general, once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalues, highest to lowest. This gave us the components in order of significance. What needs to be done now is you need to form a feature vector, which is just a fancy name for a matrix of vectors. This is constructed by taking the eigenvectors that you want to keep from the list of eigenvectors, and forming a matrix with these eigenvectors in the columns. Step 6: Deriving the new data set This final step in PCA, and is also the easiest. Once we have chosen the components (eigenvectors) that we wish to keep in our data and formed a feature vector, we have simply took the transpose of the vector and multiply it on the left of the original data set, transposed. III. CLASSIFICATION METHODS A. Neural Network Artificial neural networks (ANN) provide the powerful simulation of the information processing and widely used in patter recognition application. The most commonly used neural network is a multilayer feed forward network which focus an input layer of nodes onto output layer through a number of hidden layers. In such networks, a back propagation algorithm is usually used as training algorithm for adjusting weights [9]. The back propagation model or multi-layer perceptron is a neural network that utilizes a supervised learning technique. Typically there are one or more layers of hidden nodes between the input and output nodes. Besides, a single network can be trained to reproduce all the visual parameters as well as many networks can be trained so that each network estimates a single visual parameter. Many parameters, such as training data, transfer function, topology, learning algorithm, weights and others can be controlled in the neural network [9]. B. Support Vector Machine The main purpose of any machine learning technique is to achieve best generalization performance, given a specific amount of time and finite amount training data, by striking a balance between the goodness of fit attained on a given training dataset and the ability of the machine to achieve errorfree recognition on other datasets [10].

Figure 2: neural network design

With this concept as the basis, support vector machines have proved to achieve good generalization performance with no prior knowledge of the data. The main goal of an SVM [10] is to map the input data onto a higher dimensional feature space nonlinearly related to the input space and determine a separating hyper plane with maximum margin between the two classes in the feature space.

Figure 3: SVM margin and support vectors [10]

Main task of SVM is to finds this hyper plane using support vectors (essential training tuples) and margins (defined by the support vectors). Let data D be (Z1, y1), , (Z|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi which has either +1 or -1 value [11]. There are uncountable (infinite) lines (hyper planes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data). SVM searches for the hyper plane with the largest margin, i.e., Maximum Marginal Hyper plane (MMH) [11]. The basic concept of SVM can be summarized as, A separating hyper plane can be written as [11] (4) X *Z C 0 Where X= { X 1, X 2 ..... X n } is a weight vector and c a scalar (bias). For 2-D it can be written as [11]

X 0 X 1 Z 1 X 2 Z 2 0 , where X 0 c is additional
weight.

89

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

The hyper plane defining the sides of the margin:

H 1 : X 0 X 1 Z 1 X 2 Z 2 1 , for Yi 1 and H 1 : X 0 X 1 Z 1 X 2 Z 2 1 , for Yi 1


Any training tuples that fall on hyper planes H1 or H2 (i.e., the sides defining the margin) are support vectors [11]. If data were 3-D (i.e., with three attributes), then we have to find the best separating plane.

A. Implementation Results of ANN & PCA based character recognition Prepared PC_A matrix is given as an input to the neural network for training purpose. Similarly PC_B matrix is given to this trained network for testing purpose. The overall accuracy of 85% was obtained for the test data using ANN. B. Implementation Results of SVM & PCA based character recognition Similarly as we have described above, PC_A matrix is given as an input to the SVM for training purpose. Similarly PC_B matrix is given to this trained network for testing purpose. We have used libsvm package [12] for the classification purpose. The overall accuracy of 92% was obtained for the test data using SVM. C. Implementation Results of Euclidean distance & PCA based character recognition In this method for recognition purpose we have found the Euclidean distance between PC_A and PC_B and found the minimum index and based on this index we have found which character is recognized. PC_A and PC_B prepared using steps that we have discussed in previous section. We have measured over all accuracy of this method is 90%. D. Comparison of Recognition using ANN, SVM classifiers and Euclidean distance. In table 1 we have listed different methods and accuracy. As shown in table we can easily say that overall accuracy of PCA (SVM) is good compare to PCA (NN) and PCA (Euclidean distance) method. If we compare these methods on basis of training time then also SVM methods required less time compare to neural network and Euclidean distance. But drawback of SVM methods is we have to generate SVM format training and testing files, while in case of other methods its not required. Now if we compare individual character accuracy then also PCA (SVM) gives good result compare to other method. Table 1: Comparison of Overall Accuracy Method Structure/Parameter Accuracy PCA(Neural [25 30 6 25] 85% Network) PCA (SVM) Kernel-RBF 92% (Redial Bias Function) Cost-1 Gamma-1 PCA 90% (Euclidean distance )

After we got a trained support vector machine, we use it to classify test (new) tuples. Based on Lagrangian [11] formulation, the MMH can be rewritten as the decision boundary.

D( ZT ) Yi i Z i Z T C 0
i 1

(5)
T

Where, Yi is the class label of support vector Z i , Z is a test tuples,

i is Lagrangian multiplier, L is the number of

support vectors. C. Euclidean distance. Euclidean distance is most popular technique for finding the distance between to matrices or images. Let X, Y be two nXm images, X ( X 1, X 2 ..... X m n ) , Y (Y1,Y2 .....Ym n ) . Euclidian distance between X and Y is given by

d 2 ( X ,Y ) ( X k Y k )2
k 1

mn

(6)

IV. EXPERIMENT AND RESULTS In this work the PCA method as discussed in section II was implemented in Matlab environment. The extracted data is used as features for two classifiers, namely, neural network and support vector machine. We have prepared a real-time dataset comprising of A to J characters and digit 1 to 5. The data set was prepared by taking handwritings of different persons in a specific format. We have taken 30 samples of each character and digit, so finally our dataset contains total 450 samples for characters A to J and digits 1 to 5. We have applied PCA method on this database and prepared feature matrix PC_A. At the other side for testing purpose, we have taken 30 different images. Binarization, segmentation is applied one by one on input image. Same feature matrix PC_B is prepared for all the segmented characters.

Sr.no 1 2

90

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Table2: Comparison of Individual Character Accuracy Sr.no Letter or Accuracy Accuracy Accuracy Digit Of PCASVM (%) Of PCA-ANN (%) Of PCAEuclidean Distance (%) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A B C D E F G H I J 1 2 3 4 5 96 99 99 95 96 97 96 95 98 97 97 96 95 99 96 80 80 100 70 80 80 90 80 75 80 80 90 80 80 80 98 98 96 96 95 95 95 98 96 96 95 95 95 98 95

[4]

[5]

[6]

[7] [7]

[8]

[9]

[10]

[11] [12]

structure analysis. Proceeding of the IEEE, vol.80, and pp.1079-1092. 1992. Ravi K Sheth, N.C.Chauhan, Mahesh M Goyani, A Handwritten Character Recognition Systems using Correlation Coefficient, V V P Rajkot, 8-9 April 2011,ISBN NO: 97881-906377-5-6, pp 395-398.. Pal, U. and B.B. Chaudhuri, Indian script character recognition: A survey, Pattern Recognition, vol. 37, no.9, pp. 1887-1899, 2004. Ravi K Sheth, N C Chauhan, M G Goyni, Kinjal A Mehta, Chain code based Handwritten character recognition system using neural network and SVM, ICRTITCS-11, 9-10 December, Mumbai. Lindsay I Smith, A tutorial on Principal Components Analysis, February 26, 2002 Dewi Nasien, Habibollah Haron, Siti Sophiayati Yuhaniz The Heuristic Extraction Algorithms for Freeman Chain Code of Handwritten Character, International Journal of Experimental Algorithms, (IJEA), Volume (1): Issue (1) S. Arora" Features Combined in a MLP-based System to Recognize Handwritten Devnagari Character, Journal of Information Hiding and Multimedia Signal Processing, Volume 2, Number 1, January 2011 H. Izakian, S. A. Monadjemi, B. Tork Ladani, and K. Zamanifar Multi-Font Farsi/Arabic Isolated Character Recognition Using Chain Codes, World Academy of Science, Engineering and Technology 43 2008 C. J. C. Burges, A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998, pp 121-167. Jiawei Han and Micheline Kamber Data Mining Concepts and Techniques, 2nd Edi, MK publication, 2006, pp 337-343 Chih-Jen Lin,A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/

V. CONCLUSION A simple and an efficient off-line handwritten character recognition system using a new type of feature extraction, namely, PCA is investigated. Selection of feature extraction method is most important factor for achieving high recognition ratio. In this work, we have implemented PCA based feature extraction method. With the use of this obtained feature, we have trained the neural network as well as SVM to recognition character. We have also implemented character recognition with PCA and euclidean distance. In the investigated work all three method showed the overall recognition of 85% for PCA based neural network, 92% for PCA based SVM and 90% for PCA with Euclidean distance. REFERENCES
[1] S.. Mori, C.Y. Suen and K. Kamamoto, Historical review of OCR research and development, Proc. of IEEE, vol. 80, pp. 1029-1058, July 1992. V.K. Govindan and A.P. Shivaprasad, Character Recognition A review, Pattern Recognition, vol. 23, no. 7, pp. 671683, 1990. H.Fujisawa, Y.Nakano and K.Kurino, Segmentation methods for character recognition from segmentation to document

[2]

[3]

91

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Web Mining Using Concept-based Pattern Taxonomy Model


Sheng-Tang Wu
Dept. of Applied Informatics and Multimedia Asia University, Taichung, Taiwan swu@asia.edu.tw

Yuefeng Li
Faculty of Science and Technology Queensland University of Technology Brisbane, Australia y2.li@qut.edu.au

Yung-Chang Lin
Dept. of Applied Informatics and Multimedia Asia University, Taichung, Taiwan 100165015@live.asia.edu.tw

Abstract In the last decade, most of the current Patternbased Knowledge Discovery systems use statistical analyses only (e.g. occurrence or frequency) in the phase of pattern discovery. The downside of these approaches is that two different patterns may have the same statistical feature, yet one pattern of them may, however, contribute more to the meaning of text than the other. Therefore, how to extract the concept patterns from the data and then apply these patterns to the Pattern Taxonomy Model becomes the main purpose of this project. In order to analyze the concept of documents, the Natural Language Processing (NLP) technique is used. Moreover, with the support from lexical Ontology (e.g. Propbank), a novel concept-based pattern structure called verb-argument is defined and equipped into the proposed Concept-based Patten Taxonomy Model (CPTM). Hence, by combining the techniques from several fields (including NLP, Data Mining, Information Retrieval, and Text Mining), this paper aims to develop an effective and efficient model CPTM to address the aforementioned problem. The proposed model is examined by conducting real Web mining tasks and the experimental results show that CPTM model outperforms other methods such as Rocchio, BM25 and SVM. Keywords- Concept Pattern; Pattern Taxonomy; Knowledge Discovery; Web Mining; Data Mining

articles. Unfortunately, regardless of the feature of frequency, there are no other features such as the relation between words being even mentioned. Natural Language Processing (NLP) is one of the sub-fields of Artificial Intelligence (AI). The main object of NLP is to transform human language or text into a form that the machine can deal with. Generally speaking, the process of analyzing human language or text is very complex for a machine. Firstly, the text is broken into partitions or segments, and then each word is tagged with labels according to its part of speech (POS). Finally, the appropriate representatives are generated using parser based on the analysis of the relationship between words to describe the semantic information. Therefore, the relationship between discovered patterns can then be evaluated instead of using the statistical features of words. The integration of NLP and pattern taxonomy model (PTM)[17] can be expected to be able to find more useful patterns and construct more effective concept-based Pattern Taxonomies. In order to extract and analyze the concept from documents, the statistical mechanism is insufficient in the information retrieval model during the phase of pattern discovering. One possible solution is to utilize the information provided by Ontology (such as WordNet, Treebank and Propbank[10]). Therefore, a novel Conceptbased Pattern Taxonomy Model (CPTM) with support from NLP is proposed in this study for the purpose of overcoming the pre-mentioned problems caused by the use of statistical method. The typical process of Pattern-based Knowledge Discovery (PKD) has two main steps. The first step is to find proper patterns, which can represent the concept or semantic, from training data using machine learning or data mining approaches. The second step is how to effectively use these patterns to meet the users needs. However, the relationship between patterns is ignored and not taken into account in the most cases while dealing with patterns. For example, although two words have exactly the same statistical properties, the contributions of each word are sometimes not equal.[15] Therefore, the main objective of this work is to extract and quantify the concept from documents using the proposed PTM-based method.

I.

INTRODUCTION

Due to the rapid growth of digital data made available in the recent years, knowledge discovery and data mining have attracted great attention with an imminent need for turning such data into useful information and knowledge. Knowledge discovery[3, 5] can be viewed as the process of nontrivial extraction of information from large databases, information that is implicitly presented in the data, previously unknown and potentially useful for users. In the whole process of knowledge discovery, this study especially focuses on the phase between the transformed data and the discovered knowledge. As a result, the most important issue is how to mine useful patterns using data mining techniques, and then transform them into valuable rules or knowledge. The field of Web mining has drawn a lot of attention with the constant development of World Wide Web. Most of the Web content mining techniques try to use keywords as representatives to describe the concept of documents [4, 14]. In other words, the semantic of documents can be represented by a set of words frequently appeared in these

92

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

II.

LITERATURE REVIEW

The World Wide Web provides rich information on an extremely large amount of linked Web pages. Such a repository contains not only text data but also multimedia objects, such as images, audio and video clips. Data mining on the World Wide Web can be referred to as Web mining which has gained much attention with the rapid growth in the amount of information available on the internet. Web mining is classified into several categories, including Web content mining, Web usage mining and Web structure mining[9]. Data mining is the process of pattern discovery in a dataset from which noise has been previously eliminated and which has been transformed in such a way to enable the pattern discovery process. Data mining techniques are developed to retrieve information or patterns to implement a wide range of knowledge discovery tasks. In recent years, several data mining methods are proposed, such as association rule mining[1], frequent itemset mining [21], sequential pattern mining [20], maximum pattern mining [6] and closed pattern mining [19]. Most of them attempt to develop efficient mining algorithms for the purpose of finding specific patterns within a reasonable period of time. However, how to effectively use this large amount of discovered patterns is still an unsolved issue. Therefore, the pattern taxonomy mechanism [16] is proposed to replace the keyword-based methods by using tree-like taxonomies as concept representatives. Taxonomy is a structure that contains information describing the relationship between sequence and sub-sequence [18]. In addition, the performance of PTM-based models is improved by adopting the closed sequential patterns. The removal of non-closed sequential patterns also results in the increase of efficiency of the system due to the shrunken dimensionality. III. CONCEPT-BASED PTM MODEL

In this sentence, we can first label the words based on their POS. The verbs, written in bold, then can be used as node in a specific structure to describe the semantic meaning of sentence. By expanding words from each verb, a structure called Verb-Argument [10] is formed, which is defined as a conceptual pattern in this study. The following conceptual patterns are obtained from the example sentence using the above definition: [ARG0 We] have [TARGET investigated] [ARG1 Data Mining filed, developed for many years, has encountered the issues of low frequency and high dimensionality] [ARG1 Data Mining filed] [TARGET developed] [ARGM-TMP for many years] has encountered the issues of low frequency and high dimensionality [ARG1 Data Mining filed developed for many years] has [TARGET encountered] [ARG2 the issues of low frequency and high dimensionality] TARGET denotes the verb in the sentence. ARG0, ARG1 and ARGM-TMP are arguments appeared around TARGET. Therefore, a set of "Verb-Argument" can be discovered while applying it to a whole document. After the above process, our proposed CPTM can then analyze these conceptual patterns in the next phase. From the data mining point of view, the conceptual patterns are defined as two types: sequential pattern and nonsequential pattern. The definition is described as follows: Firstly, let T = {t1, t2, ..., tk} be a set of terms, which can be viewed as words or keywords in a dataset. A non-sequential pattern is then a non-ordered list of terms, which is a subset of T, denoted as {s1, s2, ..., sm} (si T). A sequential pattern, defined as S = s1, s2,...,sn (siT), is an ordered list of terms. Note that the duplication of terms is allowed in a sequence. This is different from the usual definition where a pattern consists of distinct terms. After mining conceptual patterns, the relationship between patterns has to be defined in order to establish the pattern taxonomies. Sub-sequence is defined as follows: if there exist integers 1 i1 i2 in m, such that a1 = bi1, a2 = bi2,..., an = bin, then a sequence = a1, a2,...,an is a subsequence of another sequence = b1, b2,...,bm. For example, sequence s1, s3 is a sub-sequence of sequence s1, s2, s3. However, sequence s3, s1 is not a sub-sequence of s1, s2, s3 since the order of terms is considered. In addition, we can also say sequence s1, s2, s3 is a super-sequence of s1, s3. The problem of mining sequential patterns is to find a complete set of sub-sequences from a set of sequences whose support is greater than a user-predefined threshold (minimum support). We can then acquire a set of frequent sequential conceptpatterns CP for all documents d D+, such that CP = {p1, p2,, pn}. The absolute support suppa(pi) for all pi CP is obtained as well. We firstly normalize the absolute support of each discovered pattern based on the following equation:

Concept-based PTM (CPTM) model is developed using a sentence-based framework proposed to address the text classification problems. CPTM adopts the NLP techniques by parsing and tagging each word based on its POS and generating semantic patterns as a result [15]. Different from the traditional approaches, CPTM treats each sentence as a unit rather than entire article during the phase of semantic analysis. In addition, the weight of terms (words) or phrases is estimated according to their statistical characteristics (such as the number of occurrences) in the traditional methods. However, words may have different descriptive capabilities even though they own exactly the same statistic value. Therefore, the more effective conceptual patterns that are obtained, more precisely the system can determine the concept. How can we get more useful conceptual patterns by using NLP techniques? Below is our strategy to be described. An example sentence is stated as follows: We have investigated that the Data Mining field, developed for many years, has encountered the issues of low frequency and high dimensionality.

93

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

support :: CP 0,1
such that

(1)

support pi

suppa pi p CP suppa p j
j

involves all the features including some that may come from the other patterns at the P Level. Therefore, both methods can be used for pattern evolution and evaluation in CPTM model.

(2)

As aforementioned, statistical properties (such as support and confidence) are usually adopted to evaluate the patterns while using data mining techniques to mine frequent patterns. However, these properties are not effective in the stage of pattern deployment and evolution[17]. The reason is the short patterns will be always the major factors affecting the performance due to their high frequency. Therefore, what we need is trying to adopt long patterns which provide more descriptive information. Another effective way is to construct a new pattern structure to gather relative information by using above-mentioned NLP techniques. Figure 1 shows the flowchart of proposed CPTM model.

Figure 2. Two types of Pattern Evolving.

As CPTM model is established, we apply it to the Web mining task using real Web dataset for performance evaluation. Several standard benchmark datasets are available for experimental purposes, including Reuters Corpora, OHSUMED and 20 Newsgroups Collection. The dataset used in our experiment in this study is the Reuters Corpus Volume 1 (RCV1) [13]. An RCV1 example document is illustrated in Figure 3.

Figure 3. An example RCV1 document. Figure 1. The flow chart of CPTM Web mining model.

The pattern evolution shown in Figure 1 is used to map the pattern taxonomies into a feature space for the purpose of solving the low frequency problem of long patterns. There are two approaches proposed in order to achieve the goal: Independent Pattern Evolving (IPE) and Deployed Pattern Evolving (DPE). IPE and DPE provide different representing manners for pattern evolving as shown in Figure 2. IPE deals with patterns at the early state of individual form, instead of manipulating patterns in deployed form at the late state. DPE is constructed by compounding discovered patterns from PTM into a hypothesis space, which means this action

RCV1 includes 806,791 English language news stories which were produced by Reuters journalists for the period between 20 August 1996 and 19 August 1997. These documents were formatted using a structured XML scheme. Each document is identified by a unique item ID and corresponded with a title in the field marked by the tag <title>. The main content of the story is in a distinct <text> field consisting of one or several paragraphs. Each paragraph is enclosed by the XML tag <p>. In our experiment, both the title and text fields are used and each paragraph in the text field is viewed as a transaction in a document.

94

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Figure 4 indicates the primary result of pattern analysis using Propbank scheme. The marked terms in parentheses are the verbs defined by Propbank. All of the conceptual patterns then can be generated by adopting VerbArgument frame basis. At the next stage, IPE and DPE methods are used for pattern evolving. Figure 5 illustrates the output of pattern discovery using CPTM for example. Sentence no. 1 [polic] [search] properti [own] marc dutroux chief [suspect] belgium child sex [abus] [murder] [scandal] tuesdai [found] decompos bodi two adolesc adult medic sourc Sentence no. 2 [found] two bodi [advanc] [state] decomposit sourc told [condit] anonym : : : Sentence no. 7 fate two girl [remain] mysteri Sentence no. 8 belgian girl gone [miss] recent year
Figure 4. The primary result of pattern analysis.

2: = Threshold(D+) 3: foreach negative document nd in D4: if Threshold({nd}) > 5: p = {dp in | termset(dp) nd } 6: Weight shuffling for each P in p 7: end if 8: foreach deployed pattern dp in 9: d d pattern merging dp 10: end for 11: end for IV. EXPERIMENTAL RESULTS

The effectiveness of the proposed CPTM Web mining model is evaluated by performing information filtering task with real Web dataset RCV1. The experimental results of CPTM are compared to those of other baselines, such as TFIDF, Rocchio, BM25[12] and support vector machines SVM[2, 7, 8], using several standard measures. These measures include Precision, Recall, Top-k (k = 20 in this study), Breakeven Point (b/e), F-measure, Interpolated Average Precision (IAP) and Mean Average Precision (MAP).

Table 1. Contingency table.

Figure 5. The output of pattern discovery.

In additional, the effect from the patterns derived from negative examples cannot be ignored due to their useful information[11]. There is no doubt that negative documents contain much useful information to identify ambiguous patterns during the concept learning. Therefore, it is necessary for a CPTM system to exploit these ambiguous patterns from the negative examples in order to reduce their influences. Algorithm NDP is shown as the follow. Algorithm NDP(, D+, D-) Input: A list of deployed patterns ; a list of positive and negative documents D+ and D-. Output: A set of term-weight pairs d. Method: 1: d

The precision is the fraction of retrieved documents that are relevant to the topic, and the recall is the fraction of relevant documents that have been retrieved. For a binary classification problem the judgment can be defined within a contingency table as depicted in Table 1. According to the definition in this table, the measures of Precision and Recall are denoted as TP/(TP+FP) and TP/(TP+FN) respectively, where TP (True Positives) is the number of documents the system correctly identifies as positives; FP (False Positives) is the number of documents the system falsely identifies as positives; FN (False Negatives) is the number of relevant documents the system fails to identify. The precision of top-K returned documents refers to the relative value of relevant documents in the first K returned documents. The value of K we use in the experiments is 20, denoted as "t20". Breakeven point (b/e) is used to provide another measurement for performance evaluation. It indicates the point where the value of precision equals to the value of recall for a topic. Both the b/e and F1-measure are the single-valued measures in that they only use a figure to reflect the performance over all the documents. However, we need more figures to evaluate the system as a whole. Therefore,

95

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

another measure, Interpolated Average Precision (IAP) is introduced. This measure is used to compare the performance of different systems by averaging precisions at 11 standard recall levels (i.e., recall=0.0, 0.1, ..., 1.0). The 11-points measure is used in our comparison tables indicating the first value of 11 points where recall equals to 116 Experiments and Results zero. Moreover, Mean Average Precision (MAP) is used in our evaluation which is calculated by measuring precision at each relevance document first, and averaging precisions over all topics. The decision function of SVM is defined as:
1 if w x b 0 (3) h x sign w x b 1 else
Figure 6. The comparing results shown in several standard measures.

Where x is the input space; b R is a threshold and l w i 1 y i i x i for the given training data:

V.

CONCLUSION

xi , yi , , xl , yl

(4)

Where xi Rn and yi = +1 (-1), if document xi is labeled positive (negative). i R is the weight of the training example xi and satisfies the following constraints:

i : i 0 and

l i 1

i yi 0

(5)

Since all positive documents are treated equally before the process of document evaluation, the value of i is set as 1.0 for all the positive documents and thus the i for the negative documents can be determined by using equation (5).

In general, a significant amount of patterns can be retrieved by using the data mining techniques to extract information from Web data. However, how to effectively use these discovered patterns is still an unsolved problem. Another typical issue is that only the statistic properties (such as support and confidence) are used while evaluating the effectiveness of patterns. The useful information hidden in the relationship between patterns is still not utilized. The drawback of traditional methods is that the longer patterns usually lead to a lower measure of support, resulting in the low performance. Therefore, NLP techniques can be adopted for help to define and generate the conceptual patterns. In this paper, a novel concept-based PTM Web mining model CPTM is then proposed. CPTM provides effective solutions to address the aforementioned problems by integrating NLP techniques and lexical ontology. The experimental results show that CPTM model outperforms other methods such as Rocchio, BM25 and SVM. REFERENCES [1] [2] [3] R. Agrawal, T. Imielinski, and A. Swami, "Mining association rules between sets and items in large database," in ACM-SIGMOD, 1993, pp. 207-216. C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995. V. Devedzic, "Knowledge discovery and data mining in databases," in Handbook of Software Engineering and Knowledge Engineering. vol. 1, S. K. Chang, Ed., ed: World Scientific Publishing Co., 2001, pp. 615-637. L. Edda and K. Jorg, "Text categorization with support vector machines. how to represent texts in input space?," Machine Learning, vol. 46, pp. 423444, 2002. W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus, "Knowledge discovery in databases: an overview," AI Magazine, vol. 13, pp. 57-70, 1992. K. Gouda and M. J. Zaki, "Genmax: An efficient algorithm for mining maximal frequent itemsets," Data Mining and Knowledge Discovery, vol. 11, pp. 223-242, 2005.

[4]
Figure 5. The result of CPTM comparing to other methods.

Figure 5 shows the interpolated 11-points in precisionrecalls of CPTM comparing to other methods. It indicates that the CPTM outperforms others both at the low and high recall values. Figure 6 also reveals the similar result that CPTM has better performance in all measures comparing to those of other approaches including data mining method and traditional probability method.

[5] [6]

96

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

[7] [8] [9] [10] [11] [12]

[13]

[14] [15] [16] [17] [18]

[19]

[20]

[21]

T. Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization," in ICML, 1997, pp. 143-151. T. Joachims, "Transductive inference for text classification using support vector machines," in ICML, 1999, pp. 200-209. C. Kaur and R. R. Aggarwal, "Web Mining Tasks and Types: A Survey," IJRIM, vol. 2, pp. 547-558, 2012. P. Kingsbury and M. Palmer, "Propbank: the next level of Treebank," in Treebanks and Lexical Theories, 2003. Y. Li, X. Tao, A. Algarni, and S.-T. Wu, "Mining Specific and General Features in Both Positive and Negative Relevance Feedback," in TREC, 2009. S. E. Robertson, S. Walker, and M. HancockBeaulieu, "Experimentation as a way of life: Okapi at trec," Information Processing and Management, vol. 36, pp. 95-108, 2000. T. Rose, M. Stevenson, and M. Whitehead, "The reuters corpus volume1- from yesterday's news to today's language resources," in Inter. Conf. on Language Resources and Evaluation, 2002, pp. 2931. F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, pp. 1-47, 2002. S. Shehata, F. Karray, and M. Kamel, "A conceptbased model for enhancing text categorization," in KDD, 2007, pp. 629-637. S.-T. Wu, Y. Li, and Y. Xu, "An effective deploying algorithm for using pattern-taxonomy," in iiWAS05, 2005, pp. 1013-1022. S.-T. Wu, Y. Li, and Y. Xu, "Deploying approaches for pattern refinement in text mining," in ICDM, 2006, pp. 1157-1161. S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, "Automatic pattern-taxonomy extraction for web mining," in IEEE/WIC/ACM International Conference on Web Intelligence, 2004, pp. 242-248. X. Yan, J. Han, and R. Afshar, "Clospan: mining closed sequential patterns in large datasets," in SIAM Int. Conf. on Data Mining (SDM03), 2003, pp. 166-177. C.-C. Yu and Y.-L. Chen, "Mining sequential patterns from multidimensional sequence data," IEEE Transactions on Knowledge and Data Engineering, vol. 17, pp. 136-140, 2005. S. Zhang, X. Wu, J. Zhang, and C. Zhang, "A decremental algorithm for maintaining frequent itemsets in dynamic databases," in International Conference on Data Warehousing and Knowledge Discovery (DaWaK05), 2005, pp. 305-314.

97

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

A New Approach to Cluster Visualization Methods Based on Self-Organizing Maps


Marcin Zimniak
Department of Computer Science Chemnitz, University of Technology Chemnitz, Germany marcin.zimniak@cs.tu-chemnitz.de

Johannes Fliege
Department of Computer Science Chemnitz, University of Technology Chemnitz, Germany johannes.fliege@cs.tu-chemnitz.de

Wolfgang Benn
Department of Computer Science Chemnitz, University of Technology Chemnitz, Germany wolfgang.benn@cs.tu-chemnitz.de
Abstract The Self-Organizing Map (SOM) is one of the artificial neural networks that perform vector quantization and vector projection simultaneously. Due to this characteristic, a SOM can be visualized twice: through the output space, which means considering the vector projection perspective, and through the input data space, emphasizing the vector quantization process. This paper aims at the idea of presenting high-dimensional clusters that are disjoint objects as groups of pairwise disjoint simple geometrical objects like 3D-spheres for instance. We expand current cluster visualization methods to gain better overview and insight into the existing clusters. We analyze the classical SOM model, insisting on the topographic product as a measure of degree of topology preservation and treat that measure as a judge tool for admissible neural net dimension in dimension reduction process. To achieve better performance and more precise results we use the SOM batch algorithm with toroidal topology. Finally, a software solution of the approach for mobile devices like iPad is presented. Keywords-Selforganizing maps (SOM); topology preservation; clustering; data-visualisation; dimension reduction; data-mining

the basic model: instead of updating prototypes one by one, they are all moved simultaneously at the end of each run, as in a standard gradient descent. In order to reduce border effects in the neural network we use a toroidal topology. For more details concerning the degree of organization we refer the reader to [1]. Applying this approach, we work with a socalled well-organized neural grid. One of our main tasks concerning the application of Self-Organizing Maps is to implement a suitable mapping procedure that should result in a topology preserving projection of high-dimensional data onto a low dimensional lattice. In our project we consider only three admissible dimensions of output space, namely ! = 1,2,3 for a given neuronal grid . However, in general, the choice of the dimension for the neural net does not guarantee to produce a topology-preserving mapping. Thus, the interpretation of the resulting map may fail. Therefore, we introduce the very important concept of a topologically preserving mapping, which means that similar data vectors are mapped onto the same or neighbored locations in the lattice and vice versa. In this paper we propose a new concept of cluster visualization; we illustrate clusters as disjoint objects in pairs of simple geometrical objects like spheres in 3D centered at best matching units (BMUs) coordinates within a neural network of admissible dimension. Our paper is organized as follows: in section 2 we give a precise mathematical description of SOM including the topology preservation measure (topographic product) as a measure for an admissible dimension of the output space. In section 3 we present existing methods of cluster visualization followed by the extension of a graphical visualization method for providing a new solution. In section 4 we demonstrate a software realization approach for our new visualization concept. Finally, we outline our conclusion and emerging further work in section 5.

I.

INTRODUCTION

Neural maps are biologically inspired data representations that combine aspects of vector quantization with the property of function continuity. Self-Organizing Maps (SOMs) have been successfully applied as a tool for visualization, for clustering of multidimensional datasets, for image compression, and for speech and face recognition. A SOM is basically a method of vector quantization, i.e. this technique is obligatory in a SOM. Regarding dimensionality reduction, a SOM models data in a nonlinear and discrete way by representing it in a deformed lattice. The mapping, however, is given explicitly and well defined only for the prototypes and in most cases only offline algorithms implement SOMs. For our purpose we concern the so-called batch version of the SOM which can easily be derived from

98

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

II.

MATHEMATICAL BACKGROUND OF THE SOM

One of the powerful approaches to adopt our cluster considerations within SOM is the application of Self-Organizing Maps to implement a suitable mapping procedure, which should result in a topology-preserving projection of the highdimensional data onto a low dimensional lattice. In most applications a two- or three-dimensional SOM lattice is the common choice of lattice structure because of its easy visualization. However, in general, this choice does not guarantee to produce a topology-preserving mapping. Thus, the interpretation of the resulting map may fail. Topology preserving mapping means that similar data vectors are mapped onto the same or neighbored locations in the lattice and vice versa. A. SOM Algorithm and Toplogy Preservation Within the framework of dimensionality reduction, SOM can be interpreted intuitively as a kind of nonlinear but discrete PCA. Formally, Self-organizing maps (SOM) as a special kind of artificial neural network map project data from some (possibly high-dimensional) input space !! onto a position in some output space (neural map) , such that a continuous change of a parameter of the input data should lead to a continuous change of the position of a localized excitation in the neural map. This property of neighborhood preservation depends on an important feature of the SOM, its output space topology, which has to be predefined before the learning process to be started. If the topology of (i.e. its dimensionality and its edge length ratios) does not match that of the data shape, neighborhood violations will occur. This can be written in a formal way by defining the output space positions as = ! , ! , ! , . . , !! , 1 < ! < ! with = ! ! ! . .! where ! , = 1. . represents the dimension of (i.e. length of the edge of the lattice) in kthdirection. In general, other arrangements are possible, e.g. the definition of a connectivity matrix. Nevertheless, we consider hypercubes in our project. We associate a weight vector or pointer ! with each neuron in . The mapping !! is realized by rule: the winner takes it all (WTA). It updates only one prototype (the BMU) at each presentation of a datum. WTA is the simplest rule and includes the classical competitive learning as well as the frequency-sensitive competitive learning !! : = arg min!"# ! (1)

a corresponding graph structure in ! in : two neurons are connected in ! if and only if their masked receptive fields are neighbored. The graph ! is called the induced Delaunay-graph. For further details we refer the reader to [2]. Due to the bijective relation between neurons and weight vectors, ! also represents the Delaunay graph of the weights (Fig. 1).

Delaunay triangulaton

Voronoi diagram

Figure 1. The Delaunay triangulation and Voronoi diagram are dual to each other in the graph theoretical sense.

To achieve the map , the SOM adapts the pointer positions during the presentation of a sequence of data points selected from a data distribution () as follows: ! = !" ( ! ), (4)

where 0 1 denotes learning rate, and !" is the neighborhood function, usually chosen to be of Gaussian shape: !" = exp
!!! ! !! !

(5)

We note that !" depends on the best matching neuron defined in (1). Topology preservation in SOMs is defined as the preservation of the continuity of the mapping from the input space onto the output space. More precisely, it is equivalent to the continuity of (in the mathematical topological sense) between the topological spaces with a properly chosen metric in both and . Thus, to indicate the topographic violation we need metric and topological conditions, e.g. in Fig. 2 a) a perfect topographic map is indicated, whereas in 2 b) topography is violated. The pair of nearest neighbors ! , ! is mapped onto the neurons 1 and 3, which are not nearest neighbors. The distance relation between both is inverted as well: ! ! , ! > ! (! , ! ) but ! 1,2 < ! (1,3) . Thus, topological and metric conditions are violated. For detailed considerations we refer to [3]. The topology preserving property can be used for immediate evaluations of the resulting map, e.g. for interpretation as a color space which we applied in Sec. 3. As we already pointed out in the introduction, violations of the topographic mapping may raise false interpretations. Several approaches were developed to measure the degree of topology preservation for a given map. We chose the topographic product , which relates the sequence of input space neighbors to the sequence of output space neighbors for each neuron. Instead of using the Euclidean distances between the

where the corresponding reverse mapping is defined as !! : ! . These two functions together determine the map = !! , !! (2)

realized by the SOM network. All data points ! that are mapped onto the neuron make up its receptive field ! . The masked receptive field of neuron is defined as the ! intersection of its receptive field with namely ! = : = !! () . (3)

Therefore, the masked receptive fields ! are closed sets. All masked receptive fields form the Voronoi tessellation (diagram) of . If the intersection of two masked receptive fields ! , ! is non-vanishing (! ! ), we call both ! ! of them ! ! neighbored. The neighborhood relations form !

99

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

4 output space input space

0.2 0.1

w1 b 1

w2 2

w3 3

w4 4

0 -0.1

output space

-0.2
w1 w2 w3 w4 input space

-0.3

Figure 2. Metric vs. topological conditions for map topography.

dA

Figure 3. Values of the topographic product for the speech data.

weight vectors, this measure applies the respective distances !! (! , ! ! ) of minimal path lengths in the induced Delaunay graph ! of ! . During the computation of the se! ! quences ! () of the mth neighbors of in and ! (), describing the mth neighbor of ! have to be determined for each node . These sequences and further averaging over neighborhood orders and nodes finally lead to
% m d GV w ( d ( r, n A (r)) A 1 1 P= 2 2m log ' GV nl (r ) dV r, nlV (r) * . (6) ' d w N N r m=1 )* V ( l nlV (r ) & l=1 )
N1

observation at a time. For most methods the choice of the model largely orients the implementation towards one or the other type of algorithm. Generally, the simpler the model, the more freedom is left to the implementation. In our project we apply the batch version of the SOM described in the following algorithm: 1) Define the lattice by assigning the low-dimensional coordinates of the prototypes in the embedding space. 2) Initialize the coordinates of the prototypes in the data space. 3) Assign to and to the neighborhood function !" their scheduled values for epoch q. 4) For all points in the data set, compute all prototypes as in (1) and update them according to (4). 5) Continue with step 3 until convergence is reached (i.e. updates of the prototypes become negligible). III. DATA MINING WITH SOM If a proper SOM is trained according to the above mentioned criteria several methods for representation and postprocessing can be applied. In case of a two dimensional lattice of neurons many visualization approaches are known. The most common method for visualization of SOMs is to project the weight vectors in the first dimension of the space spanned by the principle components of the data and connecting these units to the respective nodes in the lattice that are neighbored. However, if the shape of the SOM lattice is hypercubical there are several more ways to visualize the properties of the map. For our purpose we focus only on those that are of interest in our application. An extensive overview can be found in [6]. A. Current Cluster Visualization Methods of SOM An interesting evaluation is the so-called U-matrix introduced by [5] (Fig. 4). The elements of the matrix U represent the distances between the respective weight vectors and are neighbors in the neural network A. Matrix U can be used to determine clusters within the weight vector set and, hence, within the data space. Assuming that the map is topology preserving, large values indicate cluster boundaries. If the lattice is a two-dimensional array the U-matrix can easily be viewed and gives a powerful tool for cluster analysis. Another visualization technique can be used if the lattice is threedimensional. The data points then can be mapped onto neuron r which can be identified by the color combination red, green and blue (Fig. 5) assigned to the location r. In such a way we are able to assign a color to each data point accord-

( (

) )

The sign of approximately indicates the relation between the input and output space topology whereas < 0 corresponds to a too low-dimensional input space, 0 indicates an approximate match, and > 0 corresponds to a too high-dimensional input space. In the definition of , topological and metric properties of a map are mixed. This mixture provides a simple mathematical characterization of what actually measures. However, for the case of perfect preservation of an order relation, ! ! identical sequences ! () and ! () result in taking on the value = 0. Application of SOMs to very high-dimensional data can produce difficulties that may result from the so-called curse of dimensionality: the problem of sparse data caused by the high data dimensionality. We refer to approach proposed by KASKI in [4]. B. Application of the Topographic Product involving realworld Data Data set in case of speech feature vectors (! = 19, dimension of input space) obtained from several speakers uttering the German numerals1. We see (Fig. 3) in that case topographic product single out ! 3. C. Batch Version of Kohonens Self-Organizing Map Depending on the application, data observations may arrive consecutively or alternatively, the whole data set may be available at once. In the first case, an online algorithm is applied. In the second case, an offline algorithm suffices. More precisely, offline or batch algorithms cannot work until the whole set of observations is known. On the contrary, online algorithms typically work with no more than a single
1 The data is available at III. Physikalisches Institut Goettingen; previously investigated in [8], [9].

100

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Figure 5. Cluster visualization via U-Matrix.

ing to equation (1) and similar colors will encode groups of input patterns that were mapped close to one another in the lattice A. It should be emphasized that for a proper interpretation of this color visualization, as well as for the analysis of the U-matrix, topology preservation of the map M is a strict requirement. Furthermore, we should pay regard to the fact that the topology preserving property of M must be proven prior to any evaluation of the map. B. A new Concept for Cluster Visualization We provide a new idea in order to get insight of visualizing clusters as disjoint objects in pairs of simple objects like 3D spheres, independently of the resulting admissible output space. In this manner, additionally to existing visualization methods, we are able to distinguish and illustrate the volume of each cluster obtained by the radius of the constructed spheres. In the following steps we describe our visualization approach in further detail. At the very beginning the input data set is predefined as clustered data set after the GNG [11] learning process is finished. Afterwards the batch version of the SOM algorithm is performed whereas all BMUs are computed for all input clusters respectively. Finally, the dimension reduction of the input space is achieved by utilizing the topographic product as a judgment tool for an admissible output space. Affine spaces provide a better framework for doing geometry. In particular, it is possible to deal with points, curves, surfaces, etc., in an intrinsic manner, i.e., independently of any specific choice of a coordinate system. Naturally, coordinate systems have to be chosen to finally carry out computations, but one should learn to resist the temptation to resort to coordinate systems until it becomes necessary. So, we treat the admissible output space as an affine space in intrinsic manner where no special origin is predefined. We set the origin neuron numbered with 1 (Fig. 6). For simplicity, in the neuronal grid, distances between all directly neighboring neurons are set to 1. Let ! denote the power of a cluster ! (the number of entities for a given ! ). We are aiming to construct a presentation space in homogenous form in the sense of space dimension for any case of ! . We calculate the radius of spheres2 centered on corresponding BMUs as follows:

Figure 4. Representation of positions of neurons in the threedimensional neuron lattice A as a vector c=(r,g,b) in the color space C, where r, g, b denote the intensity of the colors red green and blue. Thus, colors are assigned to categories (winner neurons).

$ & Ci ri = 0.5 &1 & Cj % j

' ) ). ) (

(7)

Obviously, spheres constructed in that manner in the output space of dimension ! do not have any point in common. In our calculations we apply a parametric equation of a sphere. In order to keep the presentation space homogenous to dimension 3 (Fig. 7), with no relative topology at presence, we extend the output space as described below. In case of ! = 3 we perform no operation, since no extension is needed (identity map). In case of ! = 2
cos , sin cos , sin , !! ! ,

(8)

where 0 < 2, 0 ! , needs to be applied. Finally, in case of ! = 1 (functions composition) the application of
cos , sin cos , sin , !! ! , (9)

where 0 < 2, 0 ! , becomes necessary. In our method we propose to describe clusters as disjoint spheres centers located at every BMUs position respectively after the batch SOM algorithm is finished. In any cases of topology preservation criterion results (1, 2 or 3 - admissible dimension of neuronal net, after dimension reduction process) we are able to construct a group of disjoint spheres in 3D. C. Comments The novelty of our approach is to present clusters via suitable separated object spheres in homogenous 3D presentation space. In contrast to the k-clustering concept [12] we apply modern Growing Neuronal Gas unsupervised learning process returning separated objects in form of a clustered probability distribution for a given input data set of possibly high dimension. Finally, we link this concept with Self-Organizing Maps framework in order to illustrate clusters in space of admissible reduced dimension. For compre-

2 In our considerations we use the term of spheres for all cases of ! regarding the topology amongst them.

101

The Eighth International Conference on Computing and Information Technology


Admissible Output Space

IC2IT 2012
Presentation Space

19 22 25 10 13 16 1 r1 7 4 5 8 17 26 r3 14 23

20 24 27 11 15

21

dA=3

12

dA=2 dA=1
Figure 7. Expansion of output space A to presentation space depending on admissible output space dimension dA.

BMU3
18 2

3 6 9

r2

quirements analyses led us to centralize computational effort, thus, utilizing the application on a mobile device only as interface for visualization and user interaction. C. Realization We separated our application into two parts: a server application, and a client application for mobile devices. As described in our requirements analysis we decided to centralize computation effort on the server side, thus, realizing SOM computations there. For realizing the SOM computations we made use of SOM Toolbox contained in Matlab by building a bridge to C++ for enabling our server application to run the necessary SOM transformations easily. Using this tool chain allowed us to prepare the cluster data for visualization by dimension reduction through SOM efficiently. The mobile application was designed to run on mobile platforms with touch interfaces but comparably low computational resources. An example screen shot of our user interface is given in Fig. 8 showing clusters, i.e. spheres, that were transformed from n-dimensional space to 3-dimensional output space using SOM. As shown in Fig. 8, the spheres are of different size. We decided to use a spheres size to implicitly visualize the number of data tuples contained in its according cluster. For determining a spheres actual size we put the number of data tuples in a cluster into relation to the number of data tuples contained in all clusters. To prevent the spheres from intersecting each other we decided to limit their size by regarding the minimum Euclidian distance min of each pair of spheres amongst all spheres into consideration. At a first glance we took the radius of a sphere into consideration for determining its size by making the radius proportionally dependent of the number of data tuples contained in the underlying cluster. Nevertheless, data is contained in a cluster, which leads us to the volume of spheres. Therefore, we decided to represent the number of tuples in a cluster by making a spheres volume dependent on these. Thus, we were able to implicitly represent the data amount contained in a cluster. Our example was based on a data set with 998 dimensions in input space. D. Capabilities of our Example The software system presented in our example is capable of visualizing information on the clustering state of a semantic based database index allowing the user to navigate through the index cluster structure. This may be performed either by using the visualization feature of the index hierarchy or by utilizing the realized SOM-based visualization feature. In future development our aim is to present more

BMU1

BMU2

r1>r3>r2; r1<0.5

Figure 6. Neurons and best matching units in a chosen admissible output space with the origin neuron intrinsically numbered with 1.

hensive source on dimension reduction of high-dimensional data the reader is referred to [13]. IV. VISUALIZING CLUSTER INFORMATION VIA SOM ON MOBILE DEVICES

The following example will describe a realization of a SOM-based cluster visualization technique for information visualization, thus, displaying a semantic-based database index cluster structure on mobile platforms. The aim was to visually represent the internal database index organization structure intuitively to a user. Our realization had to focus on different requirements. A. Requirements The implementation of a SOM-based cluster visualization platform to display a database index cluster data on mobile entities had to fulfill certain requirements. First of all, the requirement to run our application on mobile devices with potentially low computational power was a challenge. Second, the functionality of our application had to be ensured using any type of network connection provided by the mobile device also including mobile networks with low bandwidth. As a functional requirement, it was requested to visualize clusters as spheres, where the number of data tuples contained in each cluster should be presented implicitly. B. Requirements Analysis Due to computational limitations of mobile platforms, the possibility of running SOM transformations on a mobile device could not be regarded as feasible. Thus, a separation of our desired application into a client and a server part was regarded as the most promising solution. Based on the result of the analysis of our first requirement, we did not regard it as suitable to transmit all cluster data required for SOM computations. We decided to transmit only the results of the SOM process since this also seemed to guarantee a smaller data amount compared to the SOMs input data. Furthermore, we intended to reduce possible error causes with this decision regarding the possible necessity of different implementations for different mobile platforms. Finally, the re-

102

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

REFERENCES
[1] G. Andreu, A. Crespo, and J. M. Valiente. Selecting the toroidal selforganizing feature maps (tsofm) best organized to object recognition. In Proceedings of International Conference on Artificial Neural Networks, Houston (USA), volume 1327 of Lecture Notes in Computer Science, pages 13411346, June 1997. T. Martinetz and K. Schulten, Topology representing networks. Neural Networks, vol. 7, no. 3, pp. 507522, 1994. T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation in SelfOrganizing Feature Maps: Exact Definition and Measurement. IEEE Transactions on Neural Networks, 8(2):256266, 1997. S. Kaski, J. Nikkil, and T. Kohonen. Methods for interpreting a selforganized map in data analysis. In Proc. Of European Symposium on Arti cial Neural Networks (ESANN98), pages 185190, Brussels, Belgium, 1998. D facto publications. A. Ultsch. Self organized feature maps for monitoring and knowledge aquisition of a chemical process. In S. Gielen and B. Kappen, editors, Proc. ICANN93, Int. Conf. on Artifcial Neural Networks, pages 864 867, London, UK, 1993. Springer. J. Vesanto. SOM-based data visualization methods. Intelligent Data Analysis, 3(7):123456, 1999. T. Kohonen. Self-Organizing Maps. Springer, Berlin, Heidelberg, 1995. (Second Extended Edition 1997). H.U. Bauer, and K.Pawlzik, Quantifying the neighborhood preservation of self-organizing feature maps, IEE Trans. Of Neur. Netw. 3 (4), 570-579 (1992) T. Gramss, H.W. Strube, Recognition of Isolated Words Based on Psychoacoustics and Neurobiology. Speech. Comm. 9, 35-40, 1990. T. Kohonen, Self organization and associative Memory, 2nd Edition, Berlin, Germany: Springer-Verlag, 1988. Fritzke, B. (1995a). A growing neural gas network learns topologies. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages 625-632. MIT Press, Cambridge MA. Preparata and Shamos, Computational geometry, an introduction, Springer-Verlag, 1985. John A. Lee, Michel Verleysen, Nonlinear Dimensionality Reduction, Springer, 2007.

[2] [3]

C108

C113 C 112

[4]
C110 C

C107

[5]

x y
[6] [7] [8]

[9] [10] Figure 8. Visualization of clusters in three-dimensional output space after applying SOM. [11]

detailed information and to increase user interaction possibilities potentially influencing the clustering process. V. CONCLUSION AND FURTHER WORK In our paper we have deeply described SOM from the mathematical point of view, giving precise description for that kind of neuronal nets, emphasizing the role of topographic product as a criterion for admissible neuronal net dimensions in dimension reduction process. We have proposed a new illustration method for cluster visualization, linking existing visualization methods of colors (RGB) with methods of separated objects like 3D-spheres, providing better understanding of clusters as joint objects. Finally the software realization approach has been presented. In our further research we will consider a data-driven version of SOM, so called growing SOM (GSOM). Its output is a structure adapted hypercube A, produced by adaptation of both the dimensions and the respective edge length ratios of A during the learning, in addition to the usual adaptation of the weights. In comparison to the standard SOM, the overall dimensionality and the dimensions along the individual directions in A are variables that evolve into the hypercube structure most suitable for the input space topology.

[12] [13]

103

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Detecting Source Topics using Extended HITS


Mario Kubek
Faculty of Mathematics and Computer Science FernUniversitt in Hagen Hagen, Germany Email: kn.wissenschaftler@fernuni-hagen.de Faculty of Mathematics and Computer Science FernUniversitt in Hagen Hagen, Germany Email: kn.wissenschaftler@fernuni-hagen.de occurrence graphs are undirected which is suitable for the flat visualisation of term relations and for applications like query expansion via spreading activation techniques. However, real-life associations are mostly directed, e.g. an Audi a German car but not every German car is an Audi. The association of Audi with German car is therefore much stronger than the association of German car with Audi. Therefore, it actually makes sense to deal with directed term relations. The HITS algorithm [4], which was initially designed to evaluate the relative importance of nodes in web graphs (which are directed), returns two list of nodes: authorities and hubs. Authorities that are also determined by the PageRank algorithm [5], are nodes that are often linked to by many other nodes. Hubs are nodes that link to many other nodes. Nodes are assigned both a score for their authority and their hub value. For undirected graphs the authority and the hub score of a node would be the same, which is naturally not the case for the web graph. Referred to the analysis of directed co-occurrence graphs with HITS, the authorities are the characteristic terms of the analysed text, whereas the hubs represent its source topics. Therefore, it is necessary to describe the construction of directed co-occurrence graphs before getting into the details of the method to determine the source topics and its applications. Hence, the paper is organised as follows: the next section explains the methodology used. In this section it is outlined, how to calculate directed term relations from texts by using co-occurrence analysis in order to obtain directed co-occurrence graphs. Afterwards, section three presents a method that applies an extended version of the HITS algorithm that considers the strength of these directed term relations to calculate the characteristic terms and source topics in texts. Section four focuses on the conducted experiments using this method. It is also shown that the results of this method can be used to find similar and related documents in the World Wide Web. Section five concludes the paper and provides a look at options to employ this method in solutions to follow topics in large corpora like the World Wide Web. II. METHODOLOGY Well known measures to gain co-occurrence significance values on sentence level are for instance

Herwig Unger

AbstractThis paper describes a new method to determine the sources of topics in texts by analysing their directed co-occurrence graphs using an extended version of the HITS algorithm. This method can also be used to identify characteristic terms in texts. In order to obtain the needed directed term relations to cover asymmetric real-life relationships between concepts it is described how they can be calculated by statistical means. In the experiments, it is shown that the detected source topics and characteristic terms can be used to find similar documents and those that mainly deal with the source topics in large corpora like the World Wide Web. This approach also offers a new way to follow topics across multiple documents in such corpora. This application will be elaborated on as well. Keywords-Source topic detection; Co-occurrence analysis; Extended HITS; Text Mining; Web Information Retrieval

I.

INTRODUCTION AND MOTIVATION

The selection of characteristic and discriminating terms in texts through weights, often referred to as keyword extraction or terminology extraction, plays an important role in text mining and information retrieval. In [1] it has been pointed out, that graphbased methods are well suited for the analysis of cooccurrence graphs e.g. for the purpose of keyword extraction and deliver comparable results to classic approaches like TF-IDF [2] and difference analysis [3]. Especially the proposed extended version of the PageRank algorithm, that takes into account the strength of the semantic term relations in these graphs, is able to return such characteristic terms and does not rely on reference corpora. In this paper, the authors extend this approach by introducing a method to not only determine these keywords, but to also determine terms in texts that can be referred to as source topics. These terms strongly influence the main topics in texts, yet are not necessarily important keywords themselves. They are helpful when it comes to applications like following topics to their roots by analysing documents that cover them primarily. This process can span several documents. In order to automatically determine source topics of single texts, the authors present the idea to apply an extended version of the HITS algorithm [4] on directed co-occurrence graphs for this purpose. This solution will not only return the most characteristic terms of texts like the extended PageRank algorithm, but also the source topics in them. Usually, co104

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

the mutual information measure [6], the Dice [7] and Jaccard [8] coefficients, the Poisson collocation measure [9] and the log-likelihood ratio [10]. While these measures return the same value for the relation of a term A with another term B and vice versa, an undirected relation of both terms often does not represent real-life relationships very well as it has been pointed out in the introduction. Therefore, it is sensible to deal with directed relations of terms. To measure the directed relation of term A with term B, which can also be regarded as the strength of the association of term A with term B, the following formula of the conditional relative frequency can be used, whereby is the number of times term A and B co-occurred in the text on sentence level and is the number of sentences term A occurred in:

(2) Hereby, is the maximum number of sentences, any term has occurred it. A thus obtained relation of term A with term B with a high association strength can be interpreted as a recommendation of A for B. Relations gained by this means are more specific than undirected relations between terms because of their direction. They resemble a hyperlink on a website to another one. In this case however, it has not been manually and explicitly set and it carries an additional weight that indicates the strength of the term association. The set of all such relations obtained from a text represents a directed co-occurrence graph. The next step is now to analyse such graphs with an extended version HITS algorithm that regards these association strengths in order to find the source topics in texts. Therefore, in the next section the extension of the HITS algorithm is explained and a method that employs it for the analysis of directed co-occurrence graphs is outlined. III. THE ALGORITHM With the help of the knowledge to generate directed co-occurrence graphs it is now possible to introduce a new method to analyse them in order to find source topics in the texts they represent. For this purpose the application of the HITS algorithm on these graphs is sensible due to its working method that has been outlined in the introduction. The list of hub nodes in these graphs returned by HITS contain the terms that can be regarded as the source topics of the analysed texts as they represent their inherent concepts. Their hub value indicates their influence on the most important topics and terms that can be found in the list of authorities. For the calculation of these lists using HITS, it is also sensible to also include the strength of the associations between the terms. These values should also influence the calculation of the authority and hub values. The idea behind this approach is that a random walker is likely to follow links in co-occurrence graphs that lead to terms that can be easily associated with the current term he is visiting. Nodes that contain terms that are linked with a low association value however should not be visited very often. This also means that nodes that lie on paths with links of high association values should be ranked highly as they can be reached easily. Therefore, the formulas for the update rules of the HITS algorithm can be modified to include the association values Assn. The authority value of node x can then be determined using the following formula:

(1) Often, this significance differs greatly in regards of the two directions of the relations when the difference of the involved term frequencies is high. The association of a less frequently occurring term A with a frequently occurring term B could reach a value of 1.0 when A always co-occurs with B, however B's association with A could be almost 0. This means, that B's occurrence with term A is insignificant in the analysed text. That is why it is sensible to only take into account the direction of the dominant association (the one with the higher value) to generate a directed co-occurrence graph for the further considerations. However, the dominant association should be additionally weighted. In the example above, term A's association with B is 1.0. If another term C, which more frequently appears in the text than A, also cooccurs with term B each time it appears, then its association value with B would be 1.0, too. Yet, this co-occurrence is more significant than the cooccurrence of A with B. An additional weight that influences the association value and considers this fact could be determined by the (normalised) number of sentences, in which both terms co-occur or the (normalised) frequency of the term A. The normalisation basis could be the maximum number of sentences, which any term of the text has occurred in.

The association Assn of term A with term B can then be calculated using the second approach by:

105

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE I.
Term

TERMS AND PHRASES WITH HIGH AUTHORITY AND HUB VALUES OF THE W IKIPEDIA-ARTICLE LOVE:

(3) Accordingly, the hub value of node x can be calculated using the following formula:

Authority value

Term/Phrase

Hub value

love human god attachment word

0.54 0.30 0.29 0.26 0.21 0.21 0.20 0.18 0.17 0.14

friendship intimacy passion religion attraction platonic love interpersonal love heart family relationship

0.19 0.17 0.14 0.14 0.14 0.13 0.13 0.13 0.13 0.12

(4) The following steps are necessary to obtain a list for the authorities and hubs based on these update rules: 1. 2. Remove stopwords and apply stemming algorithm on all terms in the text. (Optional) Determine the dominant association for all co-occurrences using formula 1, apply the additional weight on it according to formula 2 and use the set of all these relations as a directed co-occurrence graph G. Determine the authority value a(x) and the hub value h(x) iteratively for all nodes x in G using the formulas 3 and 4 until convergence is reached (the calculated values do not change significantly in two consecutive iterations) or a fixed number of iterations has been executed. Order all nodes descendingly according to their authority and hub values and return these two ordered lists with the terms and their authority and hub values.

form life feel people buddhism TABLE II.


Term

TERMS AND PHRASES WITH HIGH AUTHORITY AND HUB VALUES OF THE W IKIPEDIA-ARTICLE EARTHQUAKE:
Authority value Term/Phrase Hub value

earthquake earth fault area boundary plate structure rupture aftershock tsunami

0.48 0.30 0.27 0.23 0.18 0.16 0.16 0.15 0.15 0.14

movement plate boundary damage zone landslide seismic activity wave ground rupture propagation

0.18 0.16 0.15 0.15 0.15 0.14 0.14 0.13 0.13 0.12

3.

4.

Now, the effectiveness of this method will be illustrated by experiments. IV. EXPERIMENTS

A. Detection of Authorities and Hubs The following tables show for two documents of the English Wikipedia the lists of the 10 terms with the highest authority and hub values. To conduct these experiments the following parameters have been used: removal of stopwords restriction to nouns baseform reduction activated phrase detection

The examples show that the extended HITS algorithm can determine the most characteristic terms (authorities) and source topics (hubs) in texts by analysing their directed co-occurrence graphs. Especially the hubs provide useful information to find suitable terms that can be used as search words in queries when background information is needed to a specific topic. However, also the terms found in the authority lists can be used as search words in order to find similar documents. This will be shown in the next subsection. B. Search Word Extraction The suitability for these terms as search words will now be shown. For this purpose, the five most important authorities and the five most important hubs of the Wikipedia article "Love" have been combined as search queries and sent to Google. These results have been obtained using the determined authorities:

106

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

way to detect plagiarised documents. It is also interesting to point out the topic drift in the results when the hubs have been used as queries. This observation indicates that the hubs of documents can be used as means to follow topics across several related documents with the help of Google. This possibility will be elaborated on in more detail in the next and final section of this paper. V. CONCLUSION In this paper, a new graph-based method to determine source topics in texts based on an extended version of the HITS algorithm has been introduced and described in detail. Its effectiveness has been shown in the experiments. Furthermore, it has been demonstrated that the characteristic terms and the source topics that this method finds in texts, can be used as search words to find similar and related documents in the World Wide Web. Especially the determined source topics can lead users to documents that primarily deal with these important aspects of their originally analysed texts. This goes beyond a simple search for similar documents as it offers a new way to search for related documents, yet it is not impossible to find similar documents when the source topics are used in queries. This functionality can be seen as a useful addition to Google Scholar (http://scholar.google.com/), which offers users the possibility to search for similar scientific articles. Additionally, interactive search systems can employ this method to provide their users functions to follow topics across multiple documents. The iterative use of source topics as search words in found documents can provide a basis for a fine-grained analysis of topical relations that exist between the search results of two consecutive queries. Documents found in later iterations in suchlike search sessions can give users valuable background information on the content and topics of their originally analysed documents. Another interesting application for this method can be seen in the automatic linking of related documents in large corpora. If a document A primarily deals with the source topics of another document B, then a link from A to B can be set. This way, the herein described approach to obtain directed term associations is modified to gain the same effect on document level, namely to calculate recommendations for specific documents. These automatically determined links can be very useful in terms of positively influencing the ranking of search results, because these links represent semantic relations between documents that have been verified in contrast to manually set links e.g. on websites, which additionally can be automatically evaluated regarding their validity by using this approach. Also, these

Figure 1: Search results for the authorities of the Wikipedia article Love

The search query containing the hubs of this article will lead to these results:

Figure 2: Search results for the hubs of the Wikipedia article Love

These search results clearly show, that they primarily deal with either the authorities or the hubs. More experiments confirm this correlation. Using the authorities as queries to Google it is possible to find similar documents to the analysed one in the Web. Usually, the analysed document itself is found among the first search results, which is not surprising though. However, it shows that this approach could be a new

107

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

automatically determined links provide a basis to rearrange returned search results based on the semantic relations between them. These approaches will be examined in later publications in detail. REFERENCES
[1] M. Kubek and H. Unger: Search Word Extraction Using Extended PageRank Calculations, In Herwig Unger, Kyandoghere Kyamaky, and Janusz Kacprzyk, editors, Autonomous Systems: Developments and Trends, volume 391 of Studies in Computational Intelligence, pages 325337, Springer Berlin / Heidelberg, 2012 [2] G. Salton, A. Wong and C.S. Yan: A vector space model for automatic indexing, Commun. ACM, 18:613620, November 1975 [3] G. Heyer, U. Quasthoff and Th. Wittig: Text Mining Wissensrohstoff Text, W3L Verlag Bochum, 2006 [4] J. M. Kleinberg: Authoritative Sources in a Hyperlinked Environment, In Proc. of ACM-SIAM Symp. Discrete Algorithms, San Francisco, California, pages 668677, January 1998 [5] L. Page, S. Brin, R. Motwani and T. Winograd: The pagerank citation ranking: Bringing order to the web, Technical report, Stanford Digital Library Technologies Project, 1998 [6] M. Buechler: Flexibles Berechnen von Kookkurrenzen auf strukturierten und unstrukturierten Daten, Masters thesis, University of Leipzig, 2006 [7] L. R. Dice: Measures of the Amount of Ecologic Association Between Species, Ecology, 26(3):297302, July 1945 [8] P. Jaccard: tude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura. Bulletin de la Socit Vaudoise des Sciences Naturelles, 37:547579, 1901 [9] U. Quasthoff and Chr. Wolff: The Poisson Collocation Measure and its Applications, In: Proc. Second International Workshop on Computational Approaches to Collocations, Wien, 2002 [10] T. Dunning: Accurate methods for the statistics of surprise and coincidence, Computational Linguistics, 19(1):6174, 1994

108

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Blended value based e-business modeling approach: A sustainable approach using QFD
Mohammed Naim A. Dewan
Curtin Graduate School of Business Curtin University Perth, Australia mohammed.dewan@postgrad.curtin.edu.au Curtin Graduate School of Business Curtin University Perth, Australia mohammed.quaddus@gsb.curtin.edu.au business model. This modeling approach is distinct in the sense that in developing the model sustainability concept is integrated with customers value requirements, businesss value requirements, and processs value requirements instead of only customers requirements. For the analysis of the dataQuality Function Deployment(QFD), Analytic Hierarchy Process (AHP), and Delphi method are used. Besides developing the blended value based e-business model this research approach will also develop a framework for modeling e-business in conjunction with blended value and sustainability which can be implemented by almost any other businesses in consideration with the business contexts.The following section clarifies the purpose of the approach. Definition of terms used in the approach is in Section 3. Anextensive literature review is covered in Section 4. Section 5 and 6 explicate the research methodology and the research process respectively. Research analysis is explained in Section 7, and finally, Section 8 concludes the article with a discussion. II. PURPOSE OF THE APPROACH The majority of research into business models in the IS field has been concerned with e-business and e-commerce [1]. There exist a number of ideas about e-business models but most of them provide only conceptual overview and concentrate only on economic aspects of the business. None of the e-business modeling ideas exclusively considers the sustainability aspects. Similarly, there is a growing number of literature available about the sustainability of businesseswhich do not focus on e-business. But the intersection of these two global trends, e-business and sustainability, need to be addressed. Although recently a very few researchers talks about green IT/IS/ICT concept but none of them clearly explains how that concept will fit in an e-business model to make it sustainable and at the same time, to protect the interests of the customers. This research approach will develop an e-business model based on blended value which will be sustainable and will safeguard the interests of the customers.The blended value requirements will identify and select the optimal design requirements necessary to be implemented for the sustainability of the businesses. Therefore, the main research questions of the approach are as follows: Q1. What are the optimal/appropriate design requirements in developing an e-business model?

Mohammed A. Quaddus

AbstractE-business and sustainability are the two current major global trends.But surprisinglynone of the e-business modeling ideas covers the sustainability aspects of the business. Recently researchers are introducing green IS/IT/ICT concept but none of them clearly explains how those concepts will be accommodated inside the e-business models. This research approach, therefore, aims to develop an e-business model in conjunction with sustainability aspects. The model explores and determines the optimal design requirements in developing an ebusiness model. This research approach also investigates how the sustainability dimensions can be integrated with the value dimensions in developing an e-business model. This modeling approach is unique in the sense that in developing the model sustainability concept is integrated with customers value requirements, businesss value requirements, and processs value requirements instead of only customers requirements. QFD, AHP, and Delphi method are used for the analysis of the data. Besides developing the blended value based e-business model this research approach also develops a framework for modeling ebusiness in conjunction with blended value and sustainability which can be implemented by almost any other businesses in consideration with the business contexts. Keywords- E-business, Business model, Sustainability, Blended value, QFD, AHP.

I. INTRODUCTION Business modeling is not new and has had significant impacts on the way businesses are planned and operated today. Whilst business models exist for several narrow areas, broad comprehensive e-business models are still very informal and generic. Majority of the business modeling ideas considers only economic value aspects of the business and do not focus on social or environmental aspects. It is surprising that although e-business and sustainability are the two current major global trends but none of the e-business modeling ideas covers the sustainability aspects of the business. Researchers are now introducing green IS/IT/ICT concept but none of them clearly explains how those concepts will be accommodated inside the e-business models. Therefore, this research approach aims to develop an e-business model in conjunction with sustainability aspects. The model will be based on blended value and will explore and determine the optimal design requirements in developing an e-business model. This research approach will also investigate how the sustainability dimensions can be integrated with the value dimensions in developing an e-

109

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Q2.How the sustainability dimensions can be integrated with the value dimensions in developing an e-business model? Based on the above research questions this research approach is consists of the following objectives: To explore and determine the optimal design requirements of an e-business model. To investigate how the concept of blended value dimensions can be used in developing an e-business model. To investigate how the sustainability dimensions can be integrated with the value dimensions in developing an ebusiness model. To develop a value-sustainability framework for modeling e-business in conjunction with blended value and sustainability concepts. III. DEFINITION OF TERMS

models have been suggested [4, 5].Timmers [6] was the first who defined e-business model in terms of the elements and their interrelationships. Applegate [7] introduces the six ebusiness models: focused distributors, portals, producers, infrastructure distributors, infrastructure portals, and infrastructure producers. Weill and Vitale [8]suggest a subdivision into so called atomic e-business models, which are analyzed according to a number of basic components. There exist few more e-business modeling approaches, such as, Rappa [9],Dubosson-Torbay et al. [10],Tapscott, Ticoll and Lowy [11], Gordijn and Akkermans [12], and more.But sustainability concept is still entirely absent in all of the ebusiness modeling ideas. B. Sustainability of Business Sustainable business means a business with dynamic balance among three mutually inter dependent elements: (i) protection of ecosystems and natural resources; (ii) economic efficiency; and (iii) consideration of social wellbeing such as jobs, housing, education, medical care and cultural opportunities [13]. Even though many scholars enlightened their study on sustainability incorporating economic, social, and environmental perspective but still most companies remain stuck in social responsibility mind-set in which societal issues are at the periphery, not the core. The solution lies in the principle of shared (blended) value, which involves creating economic value in a way that also creates value for society by addressing its needs and challenges [14]. Moreover, most of the scholars mainly express the needs for blended value and very few of them provide with only hypothetical ideas for maintaining sustainability. A complete business model for sustainability with operational directions is still lacking. C. E- business and Sustainability E-business is the point where economic value creation and information technology/ICT come together [15]. ICT can have both positive and negative impact on the society and the environment. Positive impacts can come from dematerialization and online delivery, transport and travel substitution, a host of monitoring and management applications, greater energy efficiency in production and use, and product stewardship and recycling; and negative impacts can come from energy consumption and the materials used in the production and distribution of ICT equipment, energy consumption in use directly and for cooling, short product life cycles and e-waste, and exploitative applications [16]. Technology is a source of environmental contamination during product manufacture, operation, and disposal [17-19]. Corporations have the knowledge, resources, and power to bring about enormous positive changes in the earths ecosystems[20].In consistent with the definition of environmental sustainability of IT [21], sustainability of ebusiness can be defined as the activities within the e-business domain to minimize the negative impacts and maximize the positive impacts on the society and the environment through the design, production, application, operation, and disposal of information technology and information technology-enabled products and services throughout their life cycle.

A. Blended Value Blended value is the integration of economic value, social value, and environmental value for customers, businesses, and value processes. It is different from CSR value in the sense that CSR value is separate from profit maximization and agenda is determined by external reporting, whereas blended value is integral to profit maximization and agenda is company specific and internally generated. B. Value Requirements Value requirements are the demands for the value by customers (for satisfaction), businesses (for profit), and business processes (for efficient value process). Value can be economic and/or social and/or environmental demanded by customers and/or businesses and/or business processes to fulfill the customers requirements and/or to achieve strategic goals and/or to ensure efficient value processes. C. Design Requirements Design requirements also known as HOWs are the requirements required to fulfill the blended value requirements in QFD process. After needs are revealed the companys technicians or product development team develop a set of design requirements in measurable and operable technical terms [2] to fulfill the value requirements. IV. LITERATURE REVIEW

A. Business model and e-business model Scholars have referred to business model as a statement, a description, a representation, an architecture, a conceptual toolormodel, a structural template, a method, a framework, a pattern, and as a set found by Zott et al. [3]. A study by Zott et al. [3] found that in a total of 49 conceptual studies in which the business model is clearly defined, almost one fourth of the studies are related to e-business.The majority of research into business models in the IS field has been concerned with ebusiness and e-commerce; and there have been some attempts to develop convenient classification schemas [1]. For example, definitions, components, and classifications into e-business

110

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

V. RESEARCH METHODOLOGY In this approach, initially a sustainable e-business modeling approach based on blended value is proposed after considering the previous literature and the research objectives. This proposed model can be tested with the sample data to justify its capability and validity along with the progress of the research. Any businesscan be chosen for data collection. Sample data can be collected from field study by conducting semi-structured interviews with the customers and through focus group meetings with the dept-in-charges. Once the models capability is proven, large volume of data will be collected from the customers and the organisations by organizing surveys and focus group meetings to test the comprehensive model. Therefore, both qualitative and quantitative methods will be used in this research approach for data collection and analysis. A. Research Elements This research approach uses blended value requirements and sustainability as the main elements. According to our approach, blended value is consists of three values: customer value, business value, and process value. Sustainability of business includes economic value, social value, and environmental value. Therefore, to be competitive in the market the value need to be measured from three dimensions: What total value is demanded by the customers? What total value is required by the businesses based on their strategy to reach their goals? What process value is required by the businesses to have an efficient and sustainable value processes? Consequently, based on the measurement from three dimensions blended value requirements can be categorised into 9 (nine) groups which will be used as the main elements of this approach. They are as follows: 1) Economic value for customer requirements:This means any of the customers value requirements which is somehow economically related directly or indirectly to the product or service that is to be delivered to the customer. In other words, these requirements mean all types of economic benefits that the customers are looking for. For example, price of the product or service, quality, after-sales-service, availability or ease of access, delivery, etc. appear under this category. 2) Social value for customer requirements:Social value requirements for the customer include any value delivered by the businesses for the customers society. These social value requirements are not the social responsibilities that the business organisations are thinking to perform, rather these are the requirements that the customers are expecting or indirectly demanding for their society from the products or services or from the supplier of the products or services. 3) Environmental value for customer requirements:Environmental value requirements stand for all the environmental factors related directly or indirectly, to the product or service delivered to the customer or they can be somehow related to the operations of supplier of the product or service, such as, emissions (air, water, and soil), waste,

radiation, noise, vibration, energy intensity, material intensity, heat, direct intervention on nature and landscape, etc [22]. This environmental value is demanded or expected by the customers. 4) Economic value for business requirements:These requirements are those requirements which add some economic value to the business directly or indirectly if they are fulfilled. These economic requirementsare not demanded by the customers instead they are identified by the businesses to be fulfilled to achieve the planned future goals. For example, reducing the cost of production, increase of sales and/or profit, getting cheaper raw materials, minimizing packaging and delivery cost, replacing the employees with more efficient machinery, etc. 5) Social value for business requirements: Social value requirements are to add some value to the society from businesss point of view if they are fulfilled. These value requirements reflect what social value the business is planning and willing to deliver to the customers society in time regardless of the customers demand. For instance, Lever Bros Ltd. uses few principles to focus on social value, such as, emphasising on employees personal development, training, health, and safety; improving well-being of the society at large, etc. [23]. 6) Environmental value for business requirements: Adding environmental value can be a competitive advantage for the businesses since businesses can differentiate themselves by creating products or processes that offer environmental benefits. By implementing environmental friendly operations businesses may achieve cost reductions, too. For example, reduced contaminations, recycling of materials, improved waste management, minimize packaging, etc., reduce the impact on the environment and the costs. 7) Economic value for process requirements:These are mainly related to the cost savings within the existing value processes which can be later transferred to the customers. The managers identify these value creating inefficiencies within the existing processes and try to correct them which result in some sort of economic benefits. For example, up-to-date technologies, adequate amount of training, using efficient energies, improved supply chain management systems, etc. can increase the efficiency of the value processes that can certainly add some economic value to the organisation. 8) Social value for process requirements: To identify these requirements managers look at the whole value process of the organisation and see whether there is any scope to add some value to the society they are operating within the existing value process systems. For instance, educating disadvantaged children, organising skills training for unemployed people, employing disabled people, establishing schools and colleges, sponsoring social events, organising social gathering, organising awareness programs etc. can add value to the society and most of these requirements can be easily fulfilled by the businesses without or with a little investments or efforts.

111

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Figure 1: Research approach.

9) Environmental value for process requirements:To fulfil these requirements, the businesses try to find and implement all the necessary steps within the existing value processes that will stop or reduce the chances of negative impacts and facilitate positive impacts on the environment, thus, adding some value to the environment. For example, leakage of water/oil/heat, inefficient disposal and recycling of materials, unplanned pollution (air, water, sound) management, heating and lighting inefficiency, etc. within the existing value processes result damages to the environment. Thus these requirements need to be fulfilled to minimize the impact of current value processes on the environment.

meet the given WHATs is represented as a matrix (Fig. 2) Different users build different QFD models involving different elements but the most simple and widely used QFD model contains at least the customer requirements (WHATs) and their relative importance, technical measures or design requirements (HOWs) and their relationships with the WHATs, and the importance ratings of the HOWs. Six sets of input information is required in a basic QFD model: (i) WHATs: attributes of the product as demanded by the customers, (ii) IMPORTANCE: relative importance of the above attributes as perceived by the customers, (iii) HOWs: design attributes of the product or the technical descriptors, (iv) Correlation Matrix: interrelationships among the design requirements, (v) Relationship Matrix: relationships between WHATs and HOWs (strong, medium or weak), and (vi) Competitive Assessment: assessment of customer satisfaction with the attributes of the product under consideration against the product produced by its competitor or the best manufacturer in the market [32]. The following steps are followed in a QFD analysis: Step 1: Customers are identified and their needs are collected as WHATs; Step 2: Relative importance ratings of WHATs are determined; Step 3: Competitors are identified, customer competitive analysis is conducted, and customer performance goals for WHATs are set; Step 4: Design requirements (HOWs) are generated; Step 5: Correlation between design requirements (HOWs) are determined; Step 6: Relationships between WHATs and HOWs are determined; Step 7: Initial technical ratings of HOWs are determined; Step 8: Technical competitive analysis is conducted and technical performance goals for HOWs are set; Step 9: Final technical ratings of HOWs are determined. Lastly, based on the rankings of weights of HOWs the design requirements are selected.

B. Research Tools 1) Quality Function Deployment (QFD):QFD supports the product design and development process, which was laid out in the late 1960s to early 1970s in Japan by Akao [24]. QFD is based on collecting and analysing the voice of the customer that help to develop products with higher quality and meeting customer needs [25]. Therefore, it can be also used to analyse business needs and value process needs. The popular application fields of QFD are product development, quality management and customer needs analysis; however, the utilisation of QFD method has spread out to other manufacturing fields in time [26]. Recently, companies are successfully using QFD as a powerful tool that addresses strategic and operational decisions in businesses [27]. This tool is used in various fields for determining customer needs, developing priorities, formulating annual policies, benchmarking, environmental decision making, etc. Chan and Wu [26] and Mehrjerdi [27] provide a long list of areas where QFD has been applied. QFD, in this approach, will be applied as the main tool to analyse customer needs, business needs, and process value needs. It will also be used to develop and select optimised design requirements based on organisations capability to satisfy the blended value requirements for the sustainability of the businesses. QFD, in this approach, will be applied as the main tool to address customers requirements (CRs) and integrate those requirements into design requirements (DRs) to meet the sustainability requirements of buyers and stakeholders.In QFD modeling, customer requirements are referred as WHATs and how to fulfil the customers requirements are referred as HOWs. The process of using appropriate HOWs to

Figure 2: QFD layout.

2) Analytic Hierarchy Process (AHP):Saaty [28] developed analytic hierarchy process which is an established multi-criteria decision making approach that employs aunique

112

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

method of hierarchical structuring of a problem and subsequent ranking of alternative solutions by a paired comparison technique. The strengths of AHP is lied on its robust and well tested method of solution and its capability ofincorporating both quantitative and qualitative elements in evaluating alternatives [29]. AHP is a powerful and widelyused multi-criteria decision-making technique for prioritizingdecision alternatives of interest [30]. AHP is frequently used in QFD process, for instance, Han et al. [31], Das and Mukherjee [29], Park and Kim [30], Mukherjee [32], Bhattacharya et al.[33], Chan and Wu [34], Han et al. [31], Xie et al. [35], Wang et al. [36] and more. In this research approach, AHPwill be used to prioritize the blended value requirementsbefore developing design requirements inQFD process basedon customer value requirements,business value requirements, and process value requirements. 3) Delphi Method:The Delphi method has proven a popular tool in information systems (IS) research [3742]which was originally developed in the 1950s by Dalkey and his associates at the Rand Corporation [43]. Okoli and Pawlowski [42] and Grisham [43] provide with the lists of examples of research areas where Delphi was used as the major tool. This research approach will use Delphi method in designing and selecting optimised design requirements for the company in QFD process to develop the blended value based e-business model. VI. RESEARCH PROCESS Data will be collected from face to face interviews and structured focus group meetings. In this stage, blended value requirements (economic value, social value, and environmental value for customers requirements, business requirements and value process requirements) for particular products will be identified based on the existingvalue proposition, value process and value delivery. Customer requirements will be identified through open-ended semi-structured questionnaires. Business requirements and value process requirements will be identified through focus group meetings with the dept-in-charges. Required number of questionnaires will be collected from thecustomers and based on the feedback from the customersnecessarydata will be collected from structured focus group meetings. Collected data will be analyzed using AHP and QFD. There are few steps that will be used to complete the data analysis: (i) The blended value requirements will be grouped and categorized into classifications based on the type of requirements. Then they will be prioritized using AHP to find out the importance level of each of the requirements; (ii) The target level for each of the total requirement will be set depending on the importance level of the each requirement and the organisations capability and strategy. After prioritizing, total requirements will be benchmarked, if necessary, to set the target levels of the requirements; (iii) Based on the target levels of each requirements design requirements will be developed. Design requirements will be developed through Delphi method after structured discussion or focus group meeting with the related dept-in-charges. Design requirements will be benchmarked, if necessary, before setting target values for those requirements. Also, costs will be determined for elevating each design requirement; (iv) A relationship matrix between blended value requirements and design requirements will be

developed using QFD to get the weights of the each design requirement. Then based on the weights (how much each design requirement contributes to meeting each of the total requirements) certain design requirements will be selected initially; (v) Then trade-offs among the initially selected design requirements will be identified for cost savings since improving one design requirement will have a positive, negative, and/or no effect on other design requirements; (vi) Finally, design requirements will be chosen based on the following criteria: initial technical ratings based relationship matrix between total requirements and design requirements; technical priorities depending on organisations capability, and trade-offs among the design requirements. VII. RESEARCH ANALYSIS In QFD process the relationship between a blended valuerequirement (BVR) and a design requirement(DR) is described as Strong, Moderate, Little, or No relationship which are later replaced by weights (e.g. 9, 3, 1, 0) to give the relationship values needed to make the design requirement importance weight calculations. These weights are used to represent the degree of importance attributed to the relationship. Thus, as shown in Table 1, the importance weight of each design requirement can be determined by the following equation: Where, =importance weight of the wth design requirement; =importance weight of the ith blended value requirement; =relationship value between the ith blended value requirement andw th design requirement; = number of design requirements; =number of blended value requirements. = , = 1, , (1)

InTable 1, customer requirements, business requirements, and process requirements are considered as part of the blended value requirements. The importance weight of the blended value requirements will be calculated using AHP after getting data from the customers, businesses and the importance weightof the design requirements will be decided by the managers through Delphi method. According to the QFD matrix the absolute importance of the blended value requirements can be determined by the following equation: Where, =absolute importance of the ith blended value requirement (BVR ); =importance weight of theith blended value requirement; =importance weight of the wth design requirement; = +R D + .. + R D = , = 1, , (2)

Therefore, the absolute importance for the 1st blended value requirements (BVR )will be:

113

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Requirements BVR BVR . BVR BVR

CRs

R D R D R D R D R D .

DR

TABLE I.

R D R D R D R D R D .

DR

QFD MATRIX

..... ..... ..... .

R D R D R D R D R D .

DR

AI AI AI AI AI

A. I.

RI RI RI RI RI

R. I.

BVR

..... ..... ..... .

BRs

PRs

BVR BVR BVR

BVR

R D R D . AI RI R D

R D

R D R D . AI RI R D

R D

..... ..... ..... .

R D R R D D .

R D

AI AI AI

AI

RI RI RI

RI

The absolute importance and the relative importance of all other design requirements can be determined by following the Equation (1) and (4). Once the absolute importance and relative importance of the blended value requirements and design requirements are determined, the cost trade-offs will be identified through correlation matrix of QFD as mentioned in Section IV. The trade-offs among the selected design requirements are identified based on whether improving one design requirement have a positive, negative, and/or no effect on other design requirements. Finally, after considering the initial technical ratings found out from the absolute importance and relative importance of the blended value requirements and design requirements, the organisations capability, and the cost trade-offs optimized design requirements will be selected to develop the blended value based sustainable e-business model. VIII. CONCLUSION AND DISCUSSION There are number of ideas and proposals about business modeling and e-business modeling. But there is no clear proposal or idea about sustainable e-business modeling. Similarly, there are only few thoughts in the literature about blended value or shared value. But all of them considered blended value only from customers value requirements point of view. In this approach, all of the value requirements (customer, business, and process) are taken into consideration to develop the model. Therefore, this modeling approach is significant for four reasons. Firstly, there are few modeling approaches exist about e-business and sustainable business separately, but there is no approach available about e-business modeling and sustainability. Secondly, blended (economic, social, environmental)value is considered not only from customers point of view but also from businesss point of view and value processs point of view since the fulfillment of only customers requirements cannot guarantee long run sustainability. Thirdly, what was not shown before is how the sustainability dimensions can be integrated with the value dimensions in developing an e-business model. Fourthly, this modeling approach shows the way for efficient allocation of resources for the businesses by indicating theimportance level of the value requirements for the sustainability. We have shown how the proposed model needs to be implemented with detailed formulas after providing extensive literature in this field. We have also identified the necessary tools for this approach and explained the whole research process step by step. Our further research will be directed at the implementation of this approach in the real life businesses. There should not be much difficulty in implementation of this approach in any real life businesses other than accommodating the elements of this approach in different business contexts. REFERENCES
[1] [2] Al-Debei, M.M. and D. Avison, Developing a unified framework of the business model concept. European Journal of Information Systems, 2010. 19: p. 359-376. Chan, L.-K. and M.-L. Wu, A systematic approach to quality function deployment with a full illustrative example. Omega : The International Journal of Management Science, 2005. 33(2): p. 119-139.

);

=absolute importance of the 1st design requirement

. . .

..... . .

A. I. R. I.

AI RI

Note: A.I.= Absolute importance; R.I.= Relative importance; DR= Design requirements; CR= Customer requirements; BR= Business requirements; PR= Process requirements; BVR= Blended value requirements.

Thus, the relative importance of the 1st blended value requirements (BVR )will be:

In the same way, the relative importance of the 1stdesign requirements can be determined by the following equation: Where, (
);

Similarly, the absolute importance and the relative importance of all other blended value requirements can be determined by following the Equation (2) and (3). Therefore, the absolute value for the first design requirements ( )will be: = + + ..+

=relative importance of the 1st blended value requirement ( ); =absolute importance of the 1stblended value requirement ( );
Where,

=relative importance of the 1st design requirement

(3)

(4)

114

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

[17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]

Zott, C., R. Amit, and L. Massa, The Business Model: Recent Developments and Future Research. Journal of Management, 2011. 37(4): p. 1019-1042. Alt, R. and H. Zimmerman, Introduction to Special Section - Business Models. Electronic Markets, 2001. 11(1): p. 3-9. Afua, A. and C. Tucci, eds. Internet Business Models and Strategies. International Editions ed. 2001, McGraw-Hill: New York. Timmers, P., Business Models for Electronic Markets. Electronic Markets, 1998. 8(2): p. 3-8. Applegate, L.M., Emerging e-business models: lessons learned from the field. Harvard Business Review, 2001. Weill, P. and M. Vitale, What IT Infrastructure capabilities are needed to implement e-business models? MIS Quarterly, 2002. 1: p. 17-34. Rappa, M. Managing the digital enterprise - Business models on the web. 1999 4 April 2011]; Available from: http://digitalenterprise.org/models/models.html. Dubosson-Torbay, M., A. Osterwalder, and Y. Pigneur, Ebusiness model design, classification and measurements. Thunderbird International Business Review, 2001. 44(1): p. 5-23. Tapscott, D., A. Lowy, and D. Ticoll, Digital capital: Harnessing the power of business webs. Thunderbird International Business Review, 2000. 44(1): p. 5-23. Gordijn, J. and H. Akkermans. e3 Value: A Conceptual Value Modeling Approach for e-Business Development. in First International Conference on Knowledge Capture, Workshop Knowledge in e-Business. 2001. Bell, S. and S. Morse, Sustainability Indicators: measuring the immeasurable2009, London: Earthscan Publications. Porter, M.E., The big idea: creating shared value. Harvard business review, 2011. 89(1-2). Akkermans, H., Intelligent e-business: from technology to value. Intelligent Systems, IEEE, 2001. 16(4): p. 8-10. Houghton, J., ICT and the Environment in Developing Countries: A Review of Opportunities and Developments What Kind of Information Society? Governance, Virtuality, Surveillance, Sustainability, Resilience, J. Berleur, M. Hercheui, and L. Hilty, Editors. 2010, Springer Boston. p. 236-247. Brigden, K., et al., Cutting edge contamination: A study of environmental pollution during the manufacture of electronic products, 2007, Greenpeace International. p. 79-86. Greenpeace Guide to Greener Electronics. 2009. WWF/Gartner, WWF-Gartner assessment of global lowcarbon IT leadership, 2008, Gartner Inc.: Stamford CT. Shrivastava, P., The Role of Corporations in Achieving Ecological Sustainability. The Academy of Management Review, 1995. 20(4): p. 936-960. Elliot, S., Transdisciplinary perspectives on environmental sustainability: a resource base and framework for IT-enabled business transformation. MIS Q., 2011. 35(1): p. 197-236. Figge, F., et al., The Sustainability Balanced Scorecard linking sustainability management to business strategy. Business Strategy and the Environment, 2002. 11(5): p. 269-284. Zairi, M. and J. Peters, The impact of social responsibility on business performance. Managerial Auditing Journal, 2002. 17(4): p. 174 - 178. Akao, Y., Quality Function Deployment (QFD): Integrating customer requirements into Product Design1990, Cambridge, MA: Productivity Press. Delice, E.K. and Z. Gngr, A mixed integer goal programming model for discrete values of design requirements in QFD. International journal of production research, 2010. 49(10): p. 2941-2957. Chan, L.-K. and M.-L. Wu, Quality function deployment: A literature review. European Journal of Operational Research, 2002. 143(3): p. 463497. Mehrjerdi, Y.Z., Applications and extensions of quality function deployment. Assembly Automation, 2010. 30(4): p. 388-403. Saaty, T.L., AHP: The Analytic Hierarchy Process1980, New York: McGraw-Hill.

[29] Das, D. and K. Mukherjee, Development of an AHP-QFD framework for designing a tourism product. International Journal of Services and Operations Management, 2008. 4(3): p. 321-344. [30] Park, T. and K.-J. Kim, Determination of an optimal set of design requirements using house of quality. Journal of Operations Management, 1998. 16(5): p. 569-581. [31] Han, S.B., et al., A conceptual QFD planning model. The International Journal of Quality & Reliability Management, 2001. 18(8): p. 796. [32] Mukherjee, K., House of sustainability (HOS) : an innovative approach to achieve sustainability in the Indian coal sector, in Handbook of corporate sustainability : frameworks, strategies and tools, M.A. Quaddus and M.A.B. Siddique, Editors. 2011, Edward Elgar: Massachusetts, USA. p. 57-76. [33] Bhattacharya, A., B. Sarkar, and S.K. Mukherjee, Integrating AHP with QFD for robot selection under requirement perspective. International journal of production research, 2005. 43(17): p. 3671-3685. [34] Chan, L.K. and M.L. Wu, Prioritizing the technical measures in Quality Function Deployment. Quality engineering, 1998. 10(3): p. 467-479. [35] Xie, M., T.N. Goh, and H. Wang, A study of the sensitivity of customer voice in QFD analysis. International Journal of Industrial Engineering, 1998. 5(4): p. 301-307. [36] Wang, H., M. Xie, and T.N. Goh, A comparative study of the prioritization matrix method and the analytic hierarchy process technique in quality function deployment. Total Quality Management, 1998. 9(6): p. 421-430. [37] Brancheau, J.C., B.D. Janz, and J.C. Wetherbe, Key Issues in Information Systems Management: 1994-95 SIM Delphi Results. MIS Quarterly, 1996. 20(2): p. 225-242. [38] Hayne, S.C. and C.E. Pollard, A comparative analysis of critical issues facing Canadian information systems personnel: a national and global perspective. Information &amp; Management, 2000. 38(2): p. 73-86. [39] Holsapple, C.W. and K.D. Joshi, Knowledge manipulation activities: results of a Delphi study. Information &amp; Management, 2002. 39(6): p. 477-490. [40] Lai, V.S. and W. Chung, Managing international data communications. Commun. ACM, 2002. 45(3): p. 89-93. [41] Paul, M., Specification of a capability-based IT classification framework. Information &amp; Management, 2002. 39(8): p. 647-658. [42] Okoli, C. and S.D. Pawlowski, The Delphi method as a research tool: an example, design considerations and applications. Information & Management, 2004. 42(1): p. 15-29. [43] Grisham, T., The Delphi technique: a method for testing complex and multifaceted topics. International Journal of Managing Projects in Business, 2009. 2(1): p. 112-130.

115

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Protein Structure Prediction in 2D Triangular Lattice Model using Differential Evolution Algorithm
Aditya Narayan Hati
IT Department NIT Durgapur Durgapur, India
Email:smartyadi88@gmail.com

Nanda Dulal Jana


IT Department NIT Durgapur Durgapur, India
Email:nanda.jana@gmail.com

Sayantan Mandal
IT Department NIT Durgapur Durgapur, India

Jaya Sil
Dept. of CS and Tech. Bengal Eng. and Sc. University West Bengal, India

Email: pikusayantan@gmail.com Email:js@cs.becs.ac.in

AbstractProtein Structure Prediction from primary structure of a protein is a very complex and hard problem in computational biology. Here we propose differential evolutionary (DE) algorithm on 2D triangular hydrophobic-polar (HP) lattice model for predicting the primary structure of a protein. We propose an efficient and simple backtracking algorithm to avoid overlapping of the given sequence. This methodology is experimented on several benchmark sequences and compared with other similar implementation. We see that the proposed DE has been performing better and more consistent than the previous ones. Keywords-2D Triangular lattice model; Hydrophobic-polar model; Evolutionary computation; Differential Evolution; protein; backtracking and protein structure prediction.

Y W Y W

(x,y+1) (x-1,y)

(x+1,y+1) (x+1,y)

X Fig 1: Adding an auxiliary axis along the diagonal of a square lattice.

X Fig 2: Skewing the square lattice into a 2Dtriangular lattice. (x-1,y-1) (x,y-1) Fig 3: The 2D triangular lattice model neighbors of vertex (x,y).

model and to compare contemporary methods. II.

its

performances

with

other

I.

INTRODUCTION

2D TRIANGULAR LATTICE MODEL

Protein plays a key role in all biological process. It is a long sequence of 20 basic amino acids [2].The exact way proteins fold just after synthesized in the ribosome is unknown. As a consequence, the prediction of protein structure from its amino acid sequence is one of the most prominent problems in bioinformatics. There are several experimental methods for protein structure prediction such as MRI (magnetic resonance imaging) and X-ray crystallography. But these methods are expensive in terms of equipment, computation and time. Therefore computational approaches to protein structure prediction are taken care of. HP lattice model [1] is the simplest and widely used model. in this paper, 2D triangular lattice model are used for Protein Structure Prediction because this model resolves the parity problem of 2D HP lattice model and gives better structure. From computational point of view, protein structure prediction in 2D HP model is NP-complete [6]. It can be transformed into an optimization problem. Recently, several methods are proposed to solve the protein structure prediction problem. But there is no efficient method till now as it is an np hard problem. Here we introduce Differential Evolution algorithm with a simple backtracking for correction to make the sequence self-avoiding. The objective of this work is to evaluate the applicability of DE to PSP using 2D triangular HP

HP model is the most widely used models. It was introduced by Dill et al in 1987 [1]. Here 20 basic amino acids are classified into two categories (I) hydrophobic and (II) polar according to the affinity towards water. When a peptide bond occurs between two amino acids, those two amino acids are said to be consecutive otherwise those are non-consecutive. When two non-consecutive amino acids are placed side by side in lattice, we say that they are in topological contact. We have to design the model in such a way that the sequences must be self-avoiding and the nonconsecutive H amino acids make a hydrophobic core. But these HP lattice model possess a flaw referred as the parity problem. The problem is that when two residues of even distance from one another in the sequence are unable to be placed in such a way that they are in topological contact. In order to solve this parity problem, the 2D triangular HP lattice model is introduced [10]. In triangular lattice model, let and be two primary axes of square lattice. Take an auxiliary axis = + along the diagonal (Fig 1) and skew it until the angle between, becomes 120 (Fig 2). By this way, we obtain 2D HP triangular lattice model. For example (as in Fig 3), the lattice point P=(x, y) has six neighbors (x+1, y) as R, (x-1, y) as L, (x, y+1) as LU, (x, y-1) as RD, (x+1, y+1) as RU and (x-1, y-1) as LD.

116

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

III.

PREVIOUS WORK

In 2D HP lattice model, a lot of work has been done using evolutionary algorithms such as simple genetic algorithm and other variations [9], differential evolution algorithm with vector encoding scheme for initialization [11] [12] etc. But the parity problem in this lattice model is a severe bottleneck. Therefore 2D triangular lattice model is considered [13]. Recently, application of SGA, HHGA and ERS-GA are done in this model [10]. Also tabu search, particle swarm optimization, hybrid algorithms (hybrid of GA & PSO) are applied in this model. IV. METHODOLOGY

Here after initialization of the population, the algorithm goes after some operations and calculate their fitness values. At first, mutation happens. There are two types of mutation strategies has been used. At the starting point, first strategy is taken. If stagnation occurs for more than 100 next generations, the second strategy is chosen. If again stagnation occurs in the next 100 generations then the first one is chosen and so on. When first strategy is chosen binomial crossover happens and for second strategy is chosen exponential crossover is chosen. After that a repair function has been called to convert infeasible solutions to feasible ones. Then the selection procedure is done based on the greedy strategy. The initialization, mutation, crossover and selection is described in the following section. 1) Initialization In DE, for each individual component, the upper bound and lower bound bL are stored in 2 D matrix, called initialization matrix where D is the dimension of the each individual. The vector components are created in the following way, Where

In this section, the strategies proposed to improve the performance of the DE algorithm, applied in protein structure prediction using 2D triangular lattice model are described. A. Differential Evolution Algorithm Differential evolution algorithm was introduced by Storn & Price [3] [4]. It is an evolutionary algorithm, used for optimization problem. It is particularly useful if the gradient of the problem is difficult or even impossible to derive. Consider a fitness (objective) function
f : Rn R To minimize the function f , find a Rn / b R n : f (a) f (b)

xj, i, 0 rand (0,1) (bj, U bj, L) bj, L


0 rand (0,1) 1

Then a is called a global minimum for the function f . It is usually not possible to pinpoint the global minimum exactly. So candidate solutions with sufficiently good fitness value are acceptable for practical issue. There are several variants of DE proposed by Storn. We consider DE/rand/1/bin and DE/best-to-rand/2/exp in this problem. At first, the first strategy is taken. When stagnation happens for 100 generations the second strategy is taken. After that when again stagnation happens the first strategy is taken and so on. The algorithm (Fig 4) is described below.

There are basically 3 types of coordinates to represent the amino acids in lattice, Cartesian coordinates, internal coordinates and relative coordinates. The proposed DE uses relative coordinates. Based on this model, there are possible 6 movements {L, R, LU, LD, RU, RD} defined as from a point P(x, y). They are as follows: (x, y+1) as LU, (x-1, y) as L, (x1, y-1) as LD, (x, y-1) as RD, (x+1, y) as R, (x+1, y+1) as RU. If the number of amino acids in the given sequence is n, then total number of moves in the amino acid sequence is (n-1). For each target vector, we choose randomly (n-1) number of moves from 1to 6. By this way we initialize the whole population matrix calculate the energy for each target vector using the fitness function. If the target vector is infeasible we set its fitness function value to1.the number of population is np, which is a parameter of DE. We take the value of np as 5 times of dimension of target vector. 2) Mutation Mutation is a kind of exploration technique which can explore very rapidly. It creates trial vector of np numbers. The mutation process of the first strategy is as follows: Vig Xr 0 F ( Xr1 Xr 2) Where

r 0 r1 r 2 i
The mutation process of second strategy is as follows:

Vig Xr 0 F ( Xbest Xr1) F ( Xbest Xr 2)


r 0 r1 r 2 i

Where Here Xbest is the best target vector in that current generation. F (0,1) is a parameter called weighting factor. Here we have taken the value of F=0.25*(1+rand). Here rand is a uniform random generator.

117

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

3) Crossover Crossover is also a kind of exploration technique which explores in a restricted manner. In DE there are two of basic crossover techniques. They are (i) binomial crossover and (ii) exponential crossover. As we use DE/rand/1/bin and DE/bestto-rand/2/exp, we consider binomial and exponential crossover both. Binomial crossover is as follows:

Uj , i , g

Vj ,i , g if (rand (0,1) Cr Xj ,i , g otherwise

j jrand )

infeasible solutions. Then we calculate fitness function from the coordinates. The fitness function is the free energy of the sequence calculated from the model. The minimum energy of a sequence implies more stability of the molecule. Here we consider hydrophobic-polar model. By using this concept the Procedure of free energy calculation is described later. The repair process is described later. It is a tournament selection procedure based on the value of fitness function. The procedure is as follows: If f (Ui , g ) Xi , g 1 Ui ,,g otherwise f ( Xi , g ) Xi g B. Repair function After applying mutation and crossover operation, the initial solution or target vector becomes infeasible i.e. the sequence becomes non-self-avoiding. There are three ways to solve such problem which is discarding the infeasible solution, using of penalty function and repair function using backtracking. We proposed here the repair function using backtracking. The third option is illustrated in fig 7. The random movement is stored in 'S'. Each node has a value 'back' which stores number of invalid direction. Whenever back value will be greater than 5 it will cause backtrack. A pointer 'i' is used which will keep track of current working node. Now every time when backtrack will occur this 'i' pointer will decrease and the back value of the node will be set to 0. For a particular node, if placed its back value will increase by 1, now if a particular direction is not available then some strategy is taken which will be followed to place that amino acid to a new direction. The strategy is, if right is not available then it will try to place in down direction, now if down is not available then left, if left is not available then it will move up direction, and if up is not available then right direction. Number of attempt to place that particular amino acid with respect to a particular coordinate will be 4. Now when value of 'i' is equal to the length of the amino acid and it means it has repaired the whole folding.

The exponential crossover is as follows: For all j=<n>D, <n+1>D <n+L-1>D For all other j [1, D] Cr (0,1) is a parameter called crossover probability. In the exponential crossover <>D operator means modulo operator. We have taken Cr=0.8. jrand is a random value chosen from 1to d where d is the dimension of target vector. The pseudo codes of binomial and exponential crossover are given below:
jr=floor (rand(O,1)*D); //0jr<D j=jr; do { Uj,i=Vj,i; //Child inherits a mutant parameter. J= (j+1) %D; //Increment j modulo D. } while (rand(0, 1)<cr && j!=jr); //Take another mutant vector? While (j! =jr) //Take the rest, if any, from the target. { Uj,i=Vj,; j=(j+1)%D; } Fig 5: Pseudo code of exponential crossover. jr=floor (rand(0,1)*D); //0jr<D for j=1 to n { If rand (0,1)<cr or j=jr { Uj,i=Vj,i; } Else { Uj,i=Xj,i; } } Fig 6: Pseudo code of binomial crossover.

Uj , i , g {Vjj,i,i, ,gg X

Here D represents the dimension of each vector. Uj,i is the trial vector and Vj,i is the donor vector. Xj,i is the target vector. 4) Selection Selection is an exploitation technique which converges from local minima to global minimum. After doing mutation and crossover we have introduced a repair function to repair the

Fig 7: Repair function using backtracking

118

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

C. The free energy calculation procedure The free energy of an amino acid sequence is the topological contact of nonconsecutive H-H contacts. The free energy of a protein can be calculated by the following formulae:

TABLE II. Sequence 1 2 3 4 5 6 7 8 TABLE III. Sequence 1 2 3 4 5 6 7 8

THE BEST FREE ENERGY OBTAINED BY EACH ALGORITHM SGA -11 -10 -10 -16 -26 -21 -40 -33 HGA -15 -13 -10 -19 -32 -23 -46 -46 ERS-GA -15 -13 -12 -20 -32 -30 -55 -47 DE -15 -15 -12 -20 -28 -27 -49 -50

E rij ij
i, j

Where the parameters

1.0 Others ij 0.0 The pair of H and H residues


And

COMPARISON BETWEEN THE RESULTS (AVG/BEST) BETWEEN ERS-GA AND DE ERS-GA(avg/best) -12.5/-15 -10.2/-13 -8.47/-12 -16/-20 -28.13/-32 -25.3/-30 -49.43/-55 -42.37/-47 DE(avg/best) -14.8/-15 -13.4/-15 -11.2/-12 -19.4/-20 -27/-28 -26/-27 -47.2/-49 -48.2/-50

Otherwise rij 1 S and S adjacent but not connected amino acid 0


i j

V.

RESULTS AND COMPARISION

For experiments, benchmark sequences of 8 synthetic proteins have been chosen in 2D HP lattice model [10]. The minimum energy of these sequences is still unknown in 2D triangular lattice model. In this model, Simple Genetic Algorithm (SGA), Hybrid Genetic Algorithm (HGA) and Elite-based Reproduction Strategy with Genetic Algorithm (ERS-GA) have been proposed earlier. Comparing our results with these algorithms results it is seen that the DE scheme has outperformed the previous algorithms also DE works more consistently in 2D triangular lattice model. For this experiments, a machine with Pentium Core 2 Duo (1.6GHz) and 2GB RAM with Linux is used. Octave is used as the testing platform. We have done 20 times run of each benchmark sequences with this algorithm and compare with the available results. Table 1 shows the benchmark sequences on which we apply our algorithm. Table 2 show the comparison between the minimum energy found by SGA, HGA, and ERS-GA. Table 3 show the comparison between minimum energy (avg/best) between ERS-GA and DE. From the table 2 it can be observed that the results obtained from DE is better than the results of others for second and eighth benchmark sequences. The results in bold cases represent the better ones. For first, third and fourth sequences both ERS-GA and DE work fine. For the rest of the benchmark sequences, ERS-GA gives better results. But if the observation is done from the table 3 then it can be concluded that DE works most consistently for these benchmark sequences as it give higher average results for 6 of 8 benchmark sequences.
TABLE I. Sequence 1 2 3 4 5 6 7 8 length 20 24 25 36 48 50 60 64 LIST OF 8 BENCHMARK SEQUENCES Amino acid sequence (HP)2PH(HP)2(PH)2HP(PH)2 H2P2(HP2)6H2 P2HP2(H2P4)3H2 P(P2H2)2P5H5(H2P2)2P2H(HP2)2 P2H(P2H2)2P5H10P6(H2P2)2HP2H5 H2(PH)3PH4PH(P3H)2P4(HP3)2HPH4(PH)3PH2 P(PH3)2H5P3H10PHP3H12P4H6PH2PHP H12(PH)2((P2H2)2P2H)3(PH)2H11

VI.

FUTURE WORK AND CONCLUSION

There are a little bit of exploration in the field of protein structure prediction in 2D triangular lattice using evolutionary strategy. In this paper, Differential Evolution algorithm is implemented for protein structure prediction. Here new type of encoding scheme is also proposed. Invalid conformations are repaired by using backtracking method to produce valid conformations. Our experimental results show very promising and efficient than other evolutionary algorithms. In future, better results can be found by upgrading this DE strategy. There is a lot of scope in DE for upgrading in the area of initialization, mutation, crossover and selection operations. The improvement can lead to a better sub-optimal solution for this problem.

REFERENCES
[1] Lau, K. and Dill, K. A., A lattice statistical mechanics model of the conformation and sequence spaces of proteins Macromolecules, vol. 22, pp. 39863997, 1989 Charles J. Epstein, Robert F. Goldberger, and Christian B. Anfinsen. The genetic control of tertiary protein structure: Studies with model systems In Cold Spring Harbor Symposium on Quantitative Biology, pages 439449, 1963. Vol. 28. R. Stom and K. Price, "Differential Evolution - A Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces", ftp.ICSI.Berkeley.edu/pub/techreports/l9 5/tr--9 5 012.ps.z R. Storn, "On the usage of differential evolution for function optimization " Biennial Conference of the North American Fuzzy Information Processing Society (NAFIPS), IEEE, Berkeley, 1996, pp. 519-523. R. Agarwala, S. Batzoglou, V. Dancik, S. Decatur, M. Farach, S. Hannenhali, S. Muthukrishnan, and S. Skiena. Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. Journal of Computational Biology, 4(2):275-296, 1997.

[2]

[3]

[4]

[5]

119

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

[6]

. Berger, T. Leighton. Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. Journal of Computational Biology, 5(1), 27-40, 1998. [7] Huang C, Yang X, He Z: Protein folding simulations of 2D HP model by the genetic algorithm based on optimal secondary structures. Computational Biology and Chemistry 2010, 34:137-142. [8] Joel G, Martin M, Minghui J: RNA folding on the 3D triangular lattice. BMC Bioinformatics 2009, 10:369. [9] Hoque MT, Chetty M, Dooley LS: A hybrid genetic algorithm for 2D FCC hydrophobichydrophilic lattice model to predict protein folding. Advances in Artificial Intelligence, Lecture Notes in Computer Science 2006, 4304:867-876. [10] Shih-Chieh Su, Cheng-Jian Lin, Chuan-Kang Ting: An effective hybrid of hill climbing and genetic algorithm for 2D triangular protein structure

prediction from International Workshop on Computational Proteomics Hong Kong, China. 18-21 December 2010. [11] H. S. Lopes, Reginaldo Bitello: A Differential Evolution Approach for protein folding using a lattice model. Journal of Computer Science and Technology,22(6):904~908Nov.2007. [12] N. D. Jana and Jaya Sil. Protein Structure Prediction in 2D HP lattice model using differential evolutionary algorithms. In S. C. Satapathy et al (EDs) Proc. Of the Incon INDIA2012, AISC 132,PP. 281-290.2012. [13] William E. Hart and Alantha Newman, Protein Structure Prediction with lattice models, 2001 by CRC Press.

120

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Elimination of Materializations from Left/Right Deep Data Integration Plans


Janusz R. Getta
School

of Computer Science and Software Engineering University of Wollongong Wollongong, NSW Australia Email: jrg@uow.edu.au

AbstractPerformance of distributed data processing is one of the central problems in the development of modern information systems. This work considers a model of distributed system where a user application running at a central site submits the data processing requests to the remote sites. The results of processing at the remote sites are transmitted back to a central site and are simultaneously integrated into the nal outcomes of the application. Due to the external factors like network congestion or high processing load at the remote sites, transmission of data to a central site can be delayed or completely stopped. Then, it is necessary to dynamically change the original data integration plan to a new one, which allows for more efcent data integration in a changed environment. This work uses a technique of elimination of materializations from the data integration plans to create the alternative data integration plans. We propose an algorithm, which nds all possible data integration plans for a given sequence of data transmitted to a central site. We show how a data integration plans can be dynamically changed in a reply to the dynamically changing frequencies of data transmission.

I. I NTRODUCTION Distributed data processing faces an ever increasing demand for more efcient processing of user applications accessing data at numerous different locations and integrating the partial results at a central site. A distributed system based on a global view data processes the information resources available at the remotes sites through the applications running at a central site. A typical user application submits a data processing request to a global view of data, which integrates data resources available at the remote sites. The request is automatically decomposed into the elementary requests, which later on, are submitted for processing at the remote sites. The results of processing at the remote sites are transmitted back to a central site and integrated with data already available there. Data integration is performed accordingly to a data integration plan, which is prepared when a request issued by a user application is decomposed into the individual requests each one related to a different remote site. A data integration plan determines an order in which the individual requests are submitted for processing at the remote sites and a way how the partial results of are combined into the nal results. Due to the factors beyond the control of a central system the transmissions of partial results can be delayed or even completely stopped.

Then, the current data integration plan must be dynamically adjusted to the changing conditions. This work investigates when and how the current data integration plan must be changed in a reply to the increasing/decreasing intensity of transmission of data from the remote sites. The individual requests obtained from the decomposition of a global request are submitted for processing at the remote sites accordingly to entirely sequential or entirely parallel, or hybrid, i.e. mixed sequential and parallel strategies. Accordingly to an entirely sequential strategy a request qi can be submitted for processing at a remote site only when all results of the requests q1 , . . . , qi1 are available at a central site. An entirely sequential strategy is appropriate when the results received so far can be used to reduce the complexity of the remaining requests qi , . . . , qi+k . Accordingly to an entirely parallel strategy all requests q1 , . . . , qi , . . . , qi+k are submitted simultaneously for the parallel processing at the remote sites. An entirely parallel strategy is benecial when the computational complexity and the amounts of data transmitted is more or less the same for all requests. Accordingly to a mixed sequential and parallel strategy some requests are submitted sequentially while the others in parallel. Optimization of data integration plans is either static when the plans are optimized before a stage of data integration or it is dynamic when the plans are changed during the processing of the requests. A static optimization of data integration plans is more appropriate for parallel strategy than for sequential strategy because the plans cannot be changed after the submission to the remote sites. A dynamic optimization of data integration plans allows for the modication of the individual requests and change of their order during the processing of an entire request. This work considers a dynamic optimization of data integration plans for the entirely parallel processing strategy of the individual requests. The problem of dynamic optimization of data integration plans in the entirely parallel processing model can be formulated in the following way. Given a global information system that integrates a number of remote and independent sources of data. A user request q is decomposed into the elementary requests q1 , . . . , qn such that q = E(q1 , . . . , qn ). The requests q1 , . . . , qn are simultaneously submitted for the processing at the remote sites. Let r1 , . . . , rn be the individual

121

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

results obtained from the processing of the respective requests q1 , . . . , qn . Then, the nal result of a request q can be obtained from the evaluation of an expression E(r1 , . . . , rn ). If evaluation of an expression E(r1 , . . . , rn ) can be performed in many different ways, for example by changing of an order of operations, then it means that integration of the individual results r1 , . . . , rn can also be performed in many different ways. If some of the individual results are not available due to a network congestion then a way how E(r1 , . . . , rn ) is evaluated can be changed to avoid a deadlock. A problem investigated in this paper is how to dynamically adjust evaluation of E(r1 , . . . , rn ) in a response to the changing parameters of data transmission. One of the specic approaches to data integration is online data integration. In online data integration we consider an individual reply ri as a sequence of data packets ri1 , ri2 , . . . , rik1 , rik and we perform re-computation of E(r1 , . . . , rn ) each time a new packet of data is received at a central site. Such approach to data integration is more efcient because there is no need to wait for the complete results to start evaluation of E(r1 , . . . , rn ). Instead, each time a new packet of data is received at a central site then it is immediately integrated into the intermediate result no matter which remote site it comes from. To perform an online data integration, an expression E(r1 , . . . , rn ) must be transformed into a collection of the sequences of elementary operations called as data integration plans, pr1 , . . . , prn . Each one of the data integration plans pri determines how E(r1 , . . . , rn ) is recomputed for the sequences of packets ri1 , ri2 , . . . , rik1 , rik where i = 1, . . . , n. If an expression E(r1 , . . . , rn ) can be computed in may different ways then it is possible to nd many online data integration plans. Dynamic optimization of online data integration plans nds the best processing plan for the sequences of packets of data obtained in the latest period of time. If the frequences of transmission of individual results r1 , . . . , rn change in time then dynamic optimization nds a data integration plan, which is the best for the most recent frequencies of data transmission. A starting point for the dynamic optimization is a data integration expression E(r1 , . . . , rn ). Next, a data integration expression is transformed into a set of data integration plans where each plan represents an integration procedure for the increments of one argument of E(r1 , . . . , rn ) Some of the plans assume that temporary results of the processing must be stored in so called materializations while the other plans allow for processing of the same data integration expression without the materializations. A data integration system stores all plans and starts data integration accordingly to a plan with the largest possible number of materializations. Then, whenever frequency of data transmission of a given individual result grows beyond a given threshold then dynamic optimizer nds a better data integration plan and changes data integration accordingly to a new plan. The paper is organized in the following way. Section II overviews the related works in an area of optimization of data integration in distributed systems based on a global data

model. Next, Section III shows how online data integration plans can be transformed into data integration plans that include the largest possible number of materializations. In Section IV we show when and how materializations can be eliminated from left/right deep data integration plans and when and how to dynamically change the current data integration plan. Finally, section VI concludes the paper. II. P REVIOUS
WORKS

The early works [1], [2] on optimization of query processing in distributed database systems, multidatabase, and federated database systems are a starting point of research on efcient processing of data integration. Reactive query processing starts from a pre-optimized plan and whenever the external factors like network problems or unexpected congestion at a local site or unavailability of data make the current plan ineffective then further processing is continued accordingly to an updated plan. The early works on the reactive query processing techniques were either based on partitioning [3], [4] or ondynamic modication of query processing plans [5], [6], [7]. If the further computations are no longer possible then partitioning decomposes a query execution plan into a number of sub-plans and it attempts to continue processing accordingly to the sub-plans. Dynamic modication of query processing plans nds a new plan equivalent to the original one and such that it is possible to continue integration of the available data. The techniques of query scrambling [8], [9], dynamic scheduling of operators [10], and Eddies [11], dynamically change an order in which the join operations are executed depending on the join arguments available on hand. As data integration requires efcient processing of sequences of data items an important research directions were the improvements to pipelined implementation of join operation. These works include new versions of pipelined join operation such as pipelined join operator XJoin [12], ripple join [13], double pipelined join [14], and hash-merge join [15]. A technique of redundant computations simultaneously integrates data accordingly to a number of plans [16]. A concept of state modules described in [17] allows for concurrent processing of the tuples through the dynamic division of data integration tasks. An adaptive and online processing of data integration plans proposed in [18] and later on in [19] considers the sets of elementary operations for data integration and the best integration plan for recently transmitted data. The recent work [20] considers an integration model where the packets of data coming from the external sites are simultaneously integrated into the nal result. Another work [21] describes a system of data integration where the initial and simultaneous data integration plans are automatically transformed into hybrid plans. where some tasks are processed sequentially while the others are processed simultaneously. This work concentrates on simultaneous processing of a specic class of data integration plans whose syntax trees are only left/righ deep and involve the operations of join and

122

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

antijoin from the relational algebra. We show when and how data integrationn plans must be dynamically changed due to the changing frequencies of data transmission. The reviews of the most important data integration techniques proposed so far are included in [22], [23], [24], [25], [26]. III. DATA INTEGRATION
EXPRESSIONS

z U w

~
z t z

This work applies a relational model of data to formally represent data containers at the remote systems. Let x be a nonempty set of attribute names later on called as a schema and let dom(a) denotes a domain of attribute a x. A tuple t dened over a schema x is a full mapping t : x ax dom(a) and such that a x, t(a) dom(a). A relational table created on a schema x is a set of tuples over a schema x. Let r, s be the relational tables such that schema(r) = x, schema(s) = y respectively and let z x, v (x y), and v = . The symbols , z , v , v , v , , , denote the relational algebra operations of selection, projection, join, antijoin, semijoin, and set algebra operations of union, intersection, and difference. All join operations are considered to be equijoin operations over a set of attributes v. A modication of a relational table r is denoted by and it is dened as a pair of disjoint relational tables < , + > such that r = and r + = . An data integration operation that applies a modication to a relational table r is denoted by r and it is dened as an expression (r ) + . Let E(r1 , . . . , ri , . . . , rn ) be a data integration expression. In order to perform data integration simultaneously with data transmission, each time a data packet i arrives at a central site and it is integrated with an argument ri , an expression E(r1 , . . . , ri i , . . . , rn ) must be recomputed. Obviously, processing of the entire expression from very beginning is too time consuming. It is faster to do it in an incremental way through processing of an increment i with the previous result of an expression E(r1 , . . . , ri , . . . , rn ). Let P(r, s) be an operation of relational algebra. Then, incremental processing of P(r , s) can be computed as P(r, s) P (, s) where P (, s) is an incremental/decremental operation (id-operation) of an argument s. The incremental processing of P(r, s ) can be computed as P(r, s) P (r, ) where P (r, ) is an incremental/decremental operation (id-operation) of an argument s. The id-operations P (, s) and P (r, ) for union, join and antijoin operations of the relational algebra are as follows [20]: (, s) =< s, + s > (r, ) =< r, + r > (, s) =< v s, + v s > (r, ) =< v r, v r > (, s) =< v s, + v s > (r, ) =< r v , r v >
+ +

Fig. 1. A syntax tree of a data integration expression (v (r x s) y t)) z w )

(1) (2) (3) (4) (5) (6)

In this work we consider data integration expressions where an operation of projection () is applied only to the nal result of the computations and operation of selection () is performed together with the binary operations of join () and antijoin (). An operation of union is distributive over the operations of join and antijoin. It is true that (r s) x t = (r x t)(s x t) and that (rs) x t = (r x t)(s x t) and that t x (r s) = (t x r) x s. It means that union operations can always be processed and the end of the computation of data integration expression. Hence, without loosing generality we consider only data integration expressions built over the operation of join and antijoin. A sample data integration expression (v (r x s) y t)) z w ) has a syntax tree given in Figure (1). As a simple example consider a data integration expression E(r, s, t) = t v (r z s). Assume that we would like to + nd how an increment s =<, s > of an argument s can be processed in an incremental way, i.e. we would like to recompute an expression E(r, ss , t) using the pevious result E(r, s, t) and the increment s . Application of the equations + (4) and (6) provides a solution E(r, s, t) (t v (s r)). Next, we consider the processing of an increment + t =<, t > of a remote data source t. In this case we need either materialization of an intermediate result of a subexpression (r z s) or transformation of the data integration expression into an equivalent one with either left- or rightdeep syntax tree and with an argument t in the leftmost or rightmost position of the tree. If a materialization mrs = r s is maintained during the processing of data integration expression then from an + equation (4) we get rst =< , t v mrs > and the incremental processing is performed accordingly to E(r, s, t) + (t v mrs ). Maintenance of a materialization mrs decreases the performance of data integration because each time the increments + + r and s are reprocessed a materializaton mrs must be + + integrated with the results r mrs and s mrs . If the + + increments r and s arrive frequently at a data integration site then a materialization mrs must be frequently integrated with the partial results. If a schema of t has common attributes in x with r then + it is possible to transform an expression t v (r s)) + + into t v ((r x t ) s). Then the computations of + r x t ) s can be performed faster than (r s) because

123

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

an increment t is small and it can be kept all the time in fast transient memory. We shall call such transformation as elimination of materialization from a data integration expression. IV. DATA INTEGRATION PLANS A data integration expression is transformed into a set of data integration plans where each plan represents an integration procedure for the increments of one argument of the original expression. In our approach a data integration plan is a sequence of so called id-operations on the increments or decrements of data containers and other xed size containers. In order to reduces the size of arguments, static optimization of data integration plans moves the unary operation towards the beginning of a plan. Additionally, the frequently updated materializations are eliminated from the plan and constant arguments and subexpressions are replaced with the precomputed values A data integration plan is a sequence of assignment statements s1 , . . . , sm where the right hand side of each statement is either an application of a modication to a data container (mj := mj i ) or an application of left or right id-operation (j := j (i , mk )). A transformation of data integration expression into data integration plans is described in [20]. As a simple example consider a data integration expression E(r, s, t) = t v (r z s). The data integration plans pr and ps for the increments of argument r and s are the following. pr : rs := r z s; mrs := mrs rs ; rst := t rs ; result := result rst ; ps : rs := r z s ; mrs := mrs rs ; rst := t rs ; result := result rst ; A data integration plan for an argument t is the following. rst := t v mrs ; result := result rst ; A data integration plan can also be represented as an extended syntax tree where the materializations are represented as square boxes attached to the edges of a tree, for example see Figure (2), or Figure (4). We say that data integration plan is a left/right deep data integration plan if it has a left/right deep syntax tree, i.e. a tree such that its every non-leaf node has at most one non-leaf descendant node, see Figures (2) or (3). In this work we consider only left/right deep data integration plans. V. E LIMINATION
OF MATERIALIZATIONS

d (de)

~
m rs t(bd)

~
r(ab) s(ac)

Fig. 2. A case when a materialization mrs cannot be removed from a data integration plan for de .

much time. Integration of the increments of data with a materialization is needed in left/right deep data integration plans when its incremented argument is not one of two arguments at the lowest level of its syntax tree. A simple solution to this problem would be to transform a left/right deep data integration plan such that an incremented argument is located at the bottom level of the syntax tree. Such transformation is always possible when a data integration plan is built only over the join operations. When a data integration expression is built over join and antijoin operation then in some cases the materializations cannot be removed. For example, it is impossible to eliminate materialization mrs from a data integration plan whose syntax tree is given in Figure (2) because the increments (de) have no common attributes with the arguments s(ab) and r(ac). Elimination of materializations from data integration plans is controlled by the following algorithm. Algorithm (1) (1) Consider a fragment of data integration plan where an increment (z) and materialization m(y) are involved in operation ((z), m(y)), see Figure (3). The operation is performed over a set of attribute x . An objective is to eliminate materialization m(y) from the computations of operation {, , }. A materialization m(y) is a result of an operation (r(v), s(w)) where {, , }. At most one of the arguments of operation (r(v), s(w)) is a materialization. (2) If r(v) is not a materialization and x v is not empty then r(v) can be reduced to r(v) (x ) where (x ) is a projection of (z) on x . (3) If s(w) is not a materialization and x w is not empty then s(w) can be reduced to s(w) (x ). (4) If both r(v) or s(w) are reduced then no more materialization can be eliminated because a leaf level of left/right deep syntax tree of data integration expression has been reached. (5) If either r(v) or s(w) is a materialization then consider a subtree with an operation in the root node as an operation . Next, consider (z) as one of the arguments of operation and either r(v) or s(w) as the second argument of operation . Finally, consider operation whose results are either r(v) or s(w). Next, re-apply the algorithm from the step (1). Correctness of the Algorithm (1) comes from the following observations. A result of operation x ((z), m(y))) does not

Elimination of materializations from data integration plans is motivated by the performance reasons. When a stream of data passing through operation of integration with a materialized view, for example in a statement mrs := mrs rs ; in the example above, is too large then integration takes too

124

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

x (z)

m(y) x

r(v)

s(w)

Fig. 3.

Syntax tree of a fragment of data integration plan

~
ab (abd) m (abc)
2

~
b s(bc)

c m (bc)
1

r(ac)

t(bd)

Fig. 4.

Syntax tree of data integration plan

change if m(y) is reduced by (z) i.e. it is the same as a reusult of x ((z), m(y) (z))) for any {, , }. As m(y) is a result of x (r(v), s(w))) and operation is disributive over {, , } then (z) can be used to reduce either one or both r(v) and s(w). Hence an operation can be effectively recomputed on the reduced arguments and there is no need to store a materialization m(y). As an example consider a syntax tree of data integration plan together with the materialization required for the computation of are given in Figure (4). Application of the steps (1), (2) and (3) of Algorithm (1) provides the reductions of r(ac) to r(ac) a (a) and m1 (bc) to m1 (bc) b (b). It allows for elimination of a materialization m2 (abc). Application of step (5) and later on the repetion of steps (1), (2), and (3) provides the reductions s(bc) b (b) and t(bd) b (b). It allows for elimination of a materialization m1 (bc). Algorithm (1) can be used for generation of all alternative data integration plans for processing of all arguments of data integration expressions. An important problem is when a materialization should be removed from a data integration plan, or speaking in another way, when a plan that uses a materialization should be replaced with another plan that does not use a materialization when processing the increments of the same argument. A decision whether a materialization must be deleted depends on time spent on its maintenance, i.e. time spent on recomputation of a materialization after one of its arguments has changed. A more efcient way to refresh materialization is to integrate the previous state of materialization with the increments of data passing through materialization node in a syntax tree of data integration plan. Then, elimination of materialization simply depends on the amounts of increments of data to be integrated with a materialization. If such amounts of data exceed a given threshold in given xed period of time then an alternative plan that does not use the materialization must be considered. If due to the large processing costs a materialization must be removed from a left/right deep data integration plan then all materializations located above the

materialization considered in a left/right deep syntax tree must be eliminated. This is because in left/right deep syntax trees there is only one path of data processing from the leaf nodes to a root node and the increments of the arguments located at the higher levels of the tree add to the increments coming from the arguments at the lower levels of the tree. It means that at any moment of data integration process there is the topmost materialization in a syntax tree still benecial for the processing and whenever it is possible all materializations above it are not used for data integration. Of course it may happen that the amounts of data passing through a materialization node drop below a threshold and a plan that involves such materialization and all materializations above must be restored. In order to quickly restore the present state of materialization without recomputing it from scratch we record the increments data passing through the materialization nodes. The saved increments are integrated with the latest state of materialization to get its most up-to-date state. Elimination and restoration of materializations is controlled by the following algorithm. Algorithm (2) (1) Consider a left/right deep syntax tree of data integration plan where the materializations m1 , m2 , . . . , mk are located along the edges of the tree starting from the lowest edge in the tree. The amounts of data that have to be integrated with the materializations in a given period of time are recorded at each materialization node. Initially, all materializations are empty and all materialization are maintained in the data integration plans. (2) At the end of every period of time check if the amounts of data to be integrated with m1 , m2 , . . . , mk do not excess a treshold value dmax . If the amounts of data that have to be integrated with a materialization mi exceed dmax in the latest period of time then whenever it is possible the plans that use the materializations mi , mi+1 , ...mk are replaced with the plans that do not use these materializations. Additionally, the increments of (1) (2) (1) (2) (1) (2) data i , i , . . . , i+1 , i+1 , . . . , k , k , . . . passing through the materialization nodes mi , mi+1 , ...mk are recorded by the system. (3) If the amounts of data passing through a materialization node mj , i > j increase above dmax then the materializations mj , mj+1 , . . . mi1 must be removed from data integration plans in the same way as in a step (2). (4) If the amounts of data passing through a materialization node mj , i < j increase above dmax then the plans are not changed. (5) If the amounts of data passing through a materialization node mj , j > i decrease below dmax then the current states of materializations mj , mj1 , . . . , mi must be restored from the recorded sequences of increments (1) (1) (2) (1) (2) j , i (2) , . . . , j1 , j1 , . . . , i , i , . . . and the old states of the materializations. (6) If the amounts of data passing through a materialization node mj , j < i decrease below dmax and the amounts

125

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

of data passing through the materialization nodes mj and above do not change then the plans are not changed. Time complexity of the algorithm is O(n) where n is the total number of operation nodes in left/right deep syntax tree of data integration experession. The algorithm sequentially updates the total amounts of data passing through the materialization nodes at the end of period of time . Whenever a change of execution plan is required the mew plans are taken from a table, which is also sequentially searched. VI. S UMMARY,
OPEN PROBLEMS

Elimination of materializations from data integration plans is required when the maintenance of materializations becomes too time consuming due to the increased intensity of data increments passing through the materialization nodes in the data integration tree. Then, it is worth to replace the current plans with the new ones that do not use the materializations. This work shows how to construct the left/right deep data integration plans that do not use given materializations and when construction of such plans is possible. In particular, we describe an algorithm that generates all data integration plans for a given data integration expression and a given set of arguments. We also show when a materialization cannot be removed from a data integration plans. Next, we propose a procedure that dynamically changes data integration plans in a reply to the increasing costs of maintenance of the selected materializations. Data integration plans considered in this work are limited to left/right deep plans, i.e. the plans whose syntax tree is left/right deep syntax tree. In a general case, some of the distributed database applications do not have left/right deep plans or their bushy plans cannot be transformed into the equivalent left/right deep plans. More research is needed to consider elimination of materializations from bushy data integration plans. Another area, that still need more research is more precise evaluation of the costs and benets coming from elimination of materializations. The algorithm proposed in this work considers only the benets coming from elimination of data integration at materialization maintenance nodes. The costs include the additional operations that must be performed on the increments and other arguments of data integration plans. An interesting problem is what happens when a materialization must be restored due to changing intensity of arriving increments of data. The costs involved are not included into the balance of costs and benets in the current model. It is also interesting how the materializations can be restored to the most up to date state in a more efcient way than by re-applying the stored modications. R EFERENCES
[1] V. Srinivasan and M. J. Carey, Compensation-based on-line query processing, in Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, 1992, pp. 331340. [2] F. Ozcan, S. Nural, P. Koksal, C. Evrendilek, and A. Dogac, Dynamic query optimization in multidatabases, Bulletin of the Technical Committee on Data Engineering, vol. 20, pp. 3845, March 1997.

[3] R. L. Cole and G. Graefe, Optimization of dynamic query evaluation plans, in Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, 1994. [4] N. Kabra and D. J. DeWitt, Efcient mid-query re-optimization of sub-optimal query execution plans, in Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 1998. [5] J. Chudziak and J. R. Getta, On efcient query evaluation in multidatabase systems, in Second International Workshop on Advances in Database and Information Systems, ADBIS95, 1995, pp. 4654. [6] J. R. Getta and S. Sedighi, Optimizing global query processing plans in heterogeneous and distributed multi database systems, in 10th Intl. Workshop on Database and Expert Systems Applications, DEXA 1999, 1999, pp. 1216. [7] J. R. Getta, Query scrambling in distributed multidatabase systems, in 11th Intl. Workshop on Database and Expert Systems Applications, DEXA 2000, 2000. [8] T. Urhan, M. J. Franklin, and L. Amsaleg, Cost based query scrambling for initial delays, in SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA, 1998, pp. 130141. [9] L. Amsaleg, J. Franklin, and A. Tomasic, Dynamic query operator scheduling for wide-area remote access, Journal of Distributed and Parallel Databases, vol. 6, pp. 217246, 1998. [10] T. Urhan and M. J. Franklin, Dynamic pipeline scheduling for improving interactive performance of online queries, in Proceedings of International Conference on Very Large Databases, VLDB 2001, 2001. [11] R. Avnur and J. M. Hellerstein, Eddies: Continuously adaptive query processing, in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 261272. [12] T. Urhan and M. J. Franklin, Xjoin: A reactively-scheduled pipelined join operator, IEEE Data Engineering Bulletin 23(2), pp. 2733, 2000. [13] P. J. Haas and J. M. Hellerstein, Ripple joins for online aggregation, in SIGMOD 1999, Proceedings ACM SIGMOD Intl. Conf. on Management of Data, 1999, pp. 287298. [14] Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S. Weld, An adaptive query execution system for data integration, in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999, pp. 299310. [15] M. F. Mokbel, M. Lu, and W. G. Aref, Hash-merge join: A nonblocking join algorithm for producing fast and early join results, 2002. [16] G. Antoshenkov and M. Ziauddin, Query processing and optmization in oracle rdb, VLDB Journal, vol. 5, no. 4, pp. 229237, 2000. [17] V. Raman, A. Deshpande, and J. M. Hellerstein, Using state modules for adaptive query processing, in Proceedings of the 19th International Conference on Data Engineering, 2003, pp. 353. [18] J. R. Getta, On adaptive and online data integration, in Intl. Workshop on Self-Managing Database Systems, 21st Intl. Conf. on Data Engineering, ICDE05, 2005, pp. 12121220. [19] , Optimization of online data integration, in Seventh International Conference on Databases and Information Systems, 2006, pp. 9197. [20] , Static optimization of data integration plans in global information systems, in 13th International Conference on Enterprise Information Systems, June 2011, pp. 141150. [21] , Optimization of task processing schedules in distributed information systems, in International Conference on Informatics Engineering and Information Science, 2011, November 2011. [22] L. Bouganim, F. Fabret, and C. Mohan, A dynamic query processing architecture for data integration systems, Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 2, pp. 4248, June 2000. [23] G. Graefe, Dynamic query evaluation plans: Some course corrections? Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 2, pp. 36, June 2000. [24] J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, and M. A. Shah, Adaptive query processing: Technology in evolution, Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 2, pp. 718, June 2000. [25] Z. G. Ives, A. Y. Levy, D. S. Weld, D. Florescu, and M. Friedman, Adaptive query processing for internet applications, Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 2, pp. 1926, June 2000. [26] A. Gounaris, N. W. Paton, A. A. Fernandes, and R. Sakellariou, Adaptive query processing: A survey, in Proceedings of 19th British National Conference on Databases, 2002, pp. 1125.

126

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

A variable neighbourhood search heuristic for the design of codes


R. Montemanni, M. Salani
Istituto Dalle Molle di Studi sullIntelligenza Articiale Scuola Universitaria Professionale della Svizzera Italiana Galleria 2, 6928 Manno, Canton Ticino, Switzerland Email: {roberto, matteo.salani}@idsia.ch

D.H. Smith, F. Hunt


Division of Mathematics and Statistics University of Glamorgan Pontypridd CF37 1DL, Wales, United Kingdom Email: {dhsmith, fhhunt}@glam.ac.uk

AbstractCodes play a central role in information theory. A code is a set of words of a given length from a given alphabet of symbols. The words of a code have to full some application-dependent constraints in order to guarantee some form of separation between the words. Typical applications of codes are for error-correction following transmission or storage of data, or for modulation of signals in communications. The target is to have codes with as many codewords as possible, in order to maximise efciency and provide freedom to engineers designing the applications. In this paper a variable neighbourhood search framework, used to construct codes in a heuristic fashion, is described. Results on different types of codes of practical interest are presented, showing the potential of the new tool. Index TermsCode design, heuristic algorithms, variable neighbourhood search.

been previously studied in the literature, are used in different practical applications. The paper demonstrates that heuristics are a valuable additional tool that can be successfully used in designing good codes. II. C ODE DESIGN PROBLEMS A code is a set of words of a given length dened over a given alphabet that fulls some dened properties. The most typical constraint is on the Hamming distance between each pair of words. The Hamming distance d(x, y) between two words x and y is dened as the number of positions in which they differ. The minimum distance of a code is the minimum Hamming distance between any pair of words of the code. Some side-constraints that depend on the specic application for which codes are dened, are also present. The objective of the problem is to nd a code that fulls all the constraints and contains the maximum possible number of words. Code design problems can easily be described in terms of combinatorial optimisation, making it possible to apply heuristic optimisation algorithms to them. III. A VARIABLE N EIGHBOURHOOD S EARCH FRAMEWORK A Variable Neighbourhood Search (VNS) algorithm [4] that combines a set of local search routines is presented. First, the local searches embedded in the algorithm are briey described. The interested reader is referred to [5] for more details. A. Seed Building A simple heuristic method to build codes examines all possible words in a given order, and incrementally accepts words that are feasible with respect to already accepted ones. The Seed Building method is built on these orderings, which are combined with the concept of seed words [6]. These seed words are an initial set of feasible words to which words are added in a given (problem dependent) order if they satisfy the necessary criteria. The set of seeds is initially empty, and one feasible random seed is added at a time. If the new seed set leads to good results when a code is built from it, the seed is kept and a new random seed is designated for testing. This increases the size of the seed set. The same rationale, which is based on some simple statistics, is used to decide whether to keep subsequent

I. I NTRODUCTION Code design is a central problem in the eld of information theory. A code is a set of words of a given length, composed from a given alphabet, and with some applicationdependent characteristics that typically guarantee some form of separation between the words. Codes are usually adopted for error correction of data, or for modulation of signals in communications. Codes have also found use in some biological applications recently [1]. The target is normally to have codes with as many words as possible. In the engineering applications this maximises efciency and provides engineers with the maximum possible freedom when designing communication systems or other specic applications, as described in the second part of this paper. This choice of target makes it natural to formalise the problem as a combinatorial optimisation problem. Depending on the underlying real applications, several types of code can be of practical interest. Many approaches to solve these problems have been proposed in recent decades. So far, most research effort to construct good codes has been based on abstract algebra and group theory [2], [3], while only a marginal exploration of heuristic algorithms has been carried out. In fact codes for error correction do need an algebraic construction to ensure efcient decoding. This is not the case in some other applications, for which heuristic techniques can be used. In this paper a set of heuristic algorithms will be described, and results obtained with them on some code design problems will be summarised. These problems, which have

127

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

seeds or not. If after a given number of iterations the quality of the solutions provided by a set of seeds is judged to be not good enough, the most recent seed is eliminated from the set, which results therefore in a reduction in the size of the seed set. In this way the set of seeds is expanded or contracted depending on the quality of the solutions built using the set itself. What happens in practice is that the size of the seed set oscillates through a range of small values. The algorithm is stopped after a given time has elapsed. B. Clique Search The idea at the basis of this local search method is that it is possible to complete a partial code in the best possible way by solving a maximum clique problem (see [7], [8]). More precisely, given a code, a random subset of the words is removed, leaving a partial code. It is possible to identify all the feasible words compatible with those already in the code, and build a graph from these words, where words are represented by vertices. Two vertices are connected if and only if the pair of words respect all of the constraints considered. It is then possible to run a maximum clique algorithm on the graph in order to complete the partial code in the best possible way. Heuristic or exact methods can be used to solve the maximum clique problem. In the implementation described here an exact or truncated version of the algorithm presented in [9] is used. The search is run repeatedly, with different random subsets. The algorithm is stopped after a given time has elapsed. C. Hybrid Search This Hybrid Search method merges together the main concepts at the basis of the two local search algorithms described in Sections III-A and III-B. There is a (small) set SeedSet of words, that play the role of the seeds of algorithm Seed Building. A set of words which are compatible with the elements of SeedSet (as in Clique Search), which are also compatible with each other in a weaker form, are identied and saved in a set V . More precisely, there is a parameter which is used to model the concept of weak compatibility: the words in V have to be compatible with each other according to a relaxed distance d = d . Weak compatibility is used to keep the set V at reasonable sizes, even in case of a very small set of seeds. A compatibility graph, for the creation of which the original distance d is used, is built on the vertex set V , as described in Section III-B. A maximum clique problem is solved on this graph. The mechanism described in Section III-A for the management of seed words, is adopted unchanged here. In this way the set SeedSet is expanded and contracted during the computation. The algorithm is stopped after a given time has elapsed. D. Iterated Greedy Search This method is different from the local search approaches previously described in the way solutions are handled. The algorithms described in the previous sections maintain a set of feasible words (i.e. respecting all the constraints) and try to enlarge this set. The Iterated Greedy Search method, which

is inspired by the method discussed in [10], works on an infeasible set of words (i.e. not all of the words are compatible with each other, according to the constraints). The method evolves by modifying words of a current solution S with the target of reducing a measure In f (S) of the constraint violations. If no violation remains, then a feasible solution has been retrieved. In more detail, the local search method works as follows. An (infeasible) solution S is created by replacing a given percentage of the words of a given feasible solution by randomly generated words. A random word is added to solution S. The following operations are then repeated until a feasible solution has been retrieved (i.e. In f (S) = 0), or a given number of iterations has been spent without improvement. A word cw of solution S is selected at random, and the change of one of its coordinates that guarantees the maximum decrease in the infeasibility measure is selected (ties among possible modications of cw are broken randomly). The word cw is then modied accordingly. When a feasible solution is retrieved, it is saved and the procedure is repeated, starting from the new solution, otherwise the last feasible solution is restored and the procedure is applied to this solution. The algorithm is stopped after a given time has elapsed. E. A Variable Neighbourhood Search approach Variable Neighbourhood Search (VNS) methods have been demonstrated to perform well and are robust (see [4]). Such algorithms work by applying different local search algorithms one after the other, aiming at differentiating the characteristics of the search-spaces visited (i.e. changing the neighbourhood). The rationale behind the idea is that combining together different local search methods, that use different optimisation logics, can lead to an algorithm capable of escaping from the local optima identied by each local search algorithm, with the help of the other local search methods. In our context, some of the local search methods previously described are applied in turn, starting each time from the best solution retrieved since the beginning (or from an empty solution, in the case of Seed Building). The algorithm is stopped after a given time has elapsed. IV. C ONSTANT WEIGHT BINARY CODES A constant weight binary code is a set of binary vectors of length n, weight w and minimum Hamming distance d. The weight of a binary vector (or word) is the number of 1s in the vector. The minimum distance of a code is the minimum Hamming distance between any pair of words. The maximum possible number of words in a constant weight code is referred to as A(n, d, w). Apart from their important role in the theory of errorcorrecting codes, constant weight codes have also found application in elds as diverse as the design of demultiplexers for nano-scale memories [11] and the construction of frequency hopping lists for use in GSM networks [12].

128

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE I I MPROVED Old LB A(41,6,5) 755 A(42,6,5) 817 A(43,6,5) 874 A(44,6,5) 941 A(45,6,5) 1009 A(46,6,5) 1097 A(47,6,5) 1172 A(48,6,5) 1254 A(49,6,5) 1343 A(50,6,5) 1429 A(51,6,5) 1517 A(52,6,5) 1617 A(53,6,5) 1719 A(54,6,5) 1822 A(55,6,5) 1936 A(32,6,6) 1353 A(33,6,6) 1528 A(34,6,6) 1740 A(35,6,6) 1973 A(36,6,6) 2240 A(37,6,6) 2539 A(38,6,6) 2836 A(39,6,6) 3167 A(40,6,6) 3545 A(41,6,6) 3964 A(42,6,6) 4397 A(43,6,6) 4860 A(44,6,6) 5378 A(45,6,6) 5933 A(46,6,6) 6521 A(47,6,6) 7160 a Later improved in b Later improved in c Later improved in d Later improved in e Later improved in f Later improved in Problem New LB 779a 841b 910b 975b 1030b 1114a 1181b 1269b 1347c 1459a 1543b 1654a 1758c 1840f 1948b 1369a 1560a 1771b 1998b 2264b 2560f 2860b 3208a 3575a 3983a 4419b 4890b 5414a 5959b 6552a 7194a [13] by [13] by [13] by [13] by [13] by [13] by Problem
CONSTANT WEIGHT BINARY CODES .

Old New Problem Old New Problem Old LB LB LB LB LB A(48,6,6) 7845 7869a A(31,8,7) 363 375f A(51,10,6) 60 A(49,6,6) 8568 8605b A(32,8,7) 403 418f A(52,10,6) 60 A(50,6,6) 9348 9380b A(33,8,7) 444 466a A(53,10,6) 63 A(51,6,6) 10175 10210b A(34,8,7) 498 516a A(54,10,6) 65 A(33,8,5) 44 45d A(35,8,7) 555 570a A(55,10,6) 68 A(38,8,6) 231 236b A(36,8,7) 622 637b A(56,10,6) 70 A(39,8,6) 252 254b A(37,8,7) 696 718a A(57,10,6) 70 A(40,8,6) 275 281b A(38,8,7) 785 795b A(58,10,6) 72 A(41,8,6) 294 297f A(39,8,7) 869 893a A(59,10,6) 77 A(43,8,6) 343 347f A(40,8,7) 977 999a A(60,10,6) 79 A(44,8,6) 355 381f A(41,8,7) 1095 1110a A(61,10,6) 83 A(45,8,6) 381 403f A(42,8,7) 1206 1227b A(62,10,6) 84 A(46,8,6) 411 432f A(43,8,7) 1347 1365a A(29,10,7) 37 A(47,8,6) 440 463c A(44,8,7) 1478 1503a A(36,10,7) 75 A(48,8,6) 477 494b A(45,8,7) 1639 1653f A(42,10,7) 133 A(49,8,6) 501 527f A(46,8,7) 1795 1813f A(56,10,7) 351 A(50,8,6) 542 567f A(47,8,7) 1987 2001f A(57,10,7) 366 A(51,8,6) 576 606f A(48,8,7) 2173 2197f A(58,10,7) 394 A(52,8,6) 609 640c A(49,8,7) 2376 2399b A(59,10,7) 414 A(53,8,6) 650 687c A(50,8,7) 2603 2615f A(60,10,7) 431 A(54,8,6) 682 726b A(51,8,7) 2839 2866f A(61,10,7) 458 A(55,8,6) 729 768f A(52,8,7) 3101 3118f A(62,10,7) 486 A(56,8,6) 766 815f A(53,8,7) 3376 3384f A(63,10,7) 514 A(57,8,6) 830 866f A(54,8,7) 3651 3667b A(30,10,8) 92 A(58,8,6) 872 912f A(55,8,7) 3941 3989b A(33,10,8) 134 A(59,8,6) 935 965f A(56,8,7) 4270 4318b A(34,10,8) 156 f A(60,8,6) 982 1019 A(59,8,7) 5384 5386f A(35,10,8) 176 A(61,8,6) 1028 1077f A(45,10,6) 49 50a A(36,10,8) 198 A(62,8,6) 1079 1130f A(48,10,6) 56 57a A(37,10,8) 223 A(63,8,6) 1143 1195f A(49,10,6) 56 59f A(38,10,8) 249 A(30,8,7) 327 340f A(50,10,6) 56 62f A(39,10,8) 285 a specic group of automorphisms or a combinatorial construction. heuristic polishing of a group code or a combinatorial construction. shortening a code of length n + 1 and weight w. a cyclic group. shortening a code of length n + 1 and weight w or w + 1 and a heuristic improvement. an unspecied method.

New LB 64d 67a 70d 73a 73h 79d 83d 85b 87c 91a 94a 98c 39d 78c 137c 358b 374b 399b 423b 449b 474b 497b 526b 93d 140d 162a 182d 205d 230d 259a 291d

Problem A(40,10,8) A(41,10,8) A(42,10,8) A(43,10,8) A(44,10,8) A(45,10,8) A(46,10,8) A(47,10,8) A(48,10,8) A(49,10,8) A(50,10,8) A(51,10,8) A(52,10,8) A(55,10,8) A(56,10,8) A(32,12,7) A(33,12,7) A(36,12,7) A(37,12,7) A(39,12,7) A(40,12,7) A(37,12,8) A(38,12,8) A(43,14,8) A(44,14,8) A(45,14,8) A(46,14,8) A(47,14,8) A(48,14,8)

Old LB 318 353 390 432 484 532 590 642 711 776 852 929 1007 1289 1405 9 10 15 16 19 20 40 40 10 12 12 13 15 18

New LB 324d 362d 398a 445f 487a 544c 595a 656e 720e 785e 858e 934e 1018e 1296e 1408a 10 11 16 17 21 22b 42d 45a 12 13 15 17 18 19

A VNS algorithm combining Seed Building and Clique Search (see Section III) was proposed in [8], and it was shown to be able to improve best-known results from the literature for many instances with parameter settings appropriate to the frequency hopping applications (29 n 63 and 5 w 8 with d = 2w 2, d = 2w 4 or d = 2w 6), for which mathematical constructions were not very well developed previously (the interested reader is referred to [8] for a detailed description of parameter tuning and experimental settings). The instances improved with respect to the stateof-the-art are summarised in Table I, where the new lower bounds provided by the VNS method (New LB) are compared with the previously best-known results (Old LB). Many results were improved, especially for large values of n. These were instances for which the previous methods used were not particularly effective. Notice that most of the instances reported in the table were later further improved by other methods, many of which again make use of heuristic criteria. See [13] for full details.

V. Q UATERNARY DNA CODES Quaternary DNA codes are sets of words of xed length n over the alphabet {A, C, G, T}. The words of a code have to satisfy the following combinatorial constraints. For each pair of words, the Hamming distance has to be at least d (constraint HD); a xed number (here taken as n/2 ) of letters of each word have to be either G or C (constraint GC); the Hamming distance between each word and the Watson-Crick complement (or reverse-complement) of each word has to be at least d (constraint RC), where the Watson-Crick complement of a word x1 x2 . . . xn is dened as xn xn1 . . . x1 with A = T , T = A, C = G, G = C. If the number of letters which are G or C in each word is n/2 , then AGC (n, d, n/2 ) is used to 4 denote the maximum number of words in a code satisfying GC,RC constraints HD and GC. A4 (n, d, n/2 ) is used to denote the maximum number of words in a code satisfying constraints HD, GC and RC. Quaternary DNA codes have applications to information storage and retrieval in synthetic DNA strands. They are used in DNA computing, as probes in DNA microarray

129

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

technologies and as molecular bar codes for chemical libraries [5]. Constraints HD and RC are used to make unwanted hybridisations less likely, while constraint GC is imposed to ensure uniform melting temperatures, where DNA melting is the process by which double-stranded DNA unwinds and separates into single strands through the breaking of hydrogen bonding between the bases. Such constraints have been used, for example, in [2], [14], where more detailed technical motivations for the constraints can be found. Lower bounds for Quaternary DNA codes obtained using different tools such as mathematical constructions, stochastic searches, templatemap strategies, genetic algorithms and lexicographic searches have been proposed (see [7], [14], [2], [15], [10], [5], [16], [17] and [18]. A VNS method embedding all the local search routines described in Section III was implemented in [5], [16]. ExperGC,RC iments were conducted for AGC (n, d, w) and A4 (n, d, w) 4 with 4 n 20, 3 d n and 21 n 30, 13 d n. In Table II the new lower bounds (New LB) retrieved by VNS during the experiments that improved the previous state-ofthe-art results (Old LB) are summarised. It is interesting to observe how substantial the improvements for this family of codes are sometimes (see, for example, AGC (19, 10, 9) and 4 AGC (19, 11, 9)). The reader is referred to [16] for a detailed 4 description of parameter tuning and experimental settings. VI. P ERMUTATION CODES A permutation code is a set of permutations in the symmetric group Sn of all permutations on n elements. The words are the permutations and the code length is n. The ability of a permutation code to correct errors is related to the minimum Hamming distance of the code. The minimum Hamming distance d is then the minimum distance taken over all pairs of distinct permutations. The maximum number of words in a code of length n with minimum distance d is denoted by M(n, d). Permutation codes (sometimes called permutation arrays) have been proposed in [19] for use with a specic modulation scheme for powerline communications. An account of the rationale for the choice of permutation codes can be found in [3]. Permutations are used to ensure that power output remains as constant as possible. As well as white Gaussian noise the codes must combat permanent narrow band noise from electrical equipment or magnetic elds, and impulsive noise. A central practical question in the theory of permutation codes is the determination of M(n, d), or of good lower bounds for M(n, d). The most complete contribution to this question is in [3]. More recently, different methods, both based on permutation groups and heuristic algorithms, have been presented in [20]. In this paper a VNS approach involving Clique Search only (basically an Iterated Clique Search method) was introduced among other approaches. In some cases the method was run on cycles of words of length n or n 1 instead of words. This reduces the complexity of the problem, making it tractable by the VNS approach. Experimental results

TABLE II I MPROVED Problem


QUATERNARY

DNA CODES . New LB 87 29 12 3974 206 62 23 10 49 20 8 6634 347 109 37 243 83 579 175 62 12 1459 407 133 49 21 10 3678 960 285 99 39 18 8 77 33 15 7

Old New Problem Old LB LB LB AGC (7,3,3) 280 288 AGC,RC (12,7,6) 83 4 4 28 AGC (7,4,3) 72 78g AGC,RC (12,8,6) 4 4 11 AGC (8,5,4) 56 63 AGC,RC (12,9,6) 4 4 3954 AGC (8,6,4) 24 28 AGC,RC (13,5,6) 4 4 AGC (9,6,4) 40 48 AGC,RC (13,7,6) 205 4 4 AGC (9,7,4) 16 18 AGC,RC (13,8,6) 61 4 4 AGC (10,4,5) 1710 2016g AGC,RC (13,9,6) 22 4 4 AGC (10,7,5) 32 34 AGC,RC (13,10,6) 9 4 4 AGC (11,7,5) 72 75 AGC,RC (14,9,7) 46 4 4 AGC (11,9,5) 10 11 AGC,RC (14,10,7) 16 4 4 GC,RC AGC (12,7,6) 179 183 A4 (14,11,7) 7 4 AGC (12,8,6) 68 118 AGC,RC (15,6,7) 6430 4 4 343 AGC (12,9,6) 23 24 AGC,RC (15,8,7) 4 4 102 AGC (13,9,6) 44 46 AGC,RC (15,9,7) 4 4 35 AGC (14,11,7) 16 17 AGC,RC (15,10,7) 4 4 AGC (15,9,7) 225 227 AGC,RC (16,9,8) 230 4 4 AGC (15,11,7) 30 34 AGC,RC (16,10,8) 74 4 4 GC (15,12,7) A4 13 15 AGC,RC (17,9,8) 549 4 AGC (17,13,8) 22 24 AGC,RC (17,10,8) 164 4 4 GC,RC AGC (18,11,9) 216 282 A4 (17,11,8) 56 4 AGC (18,13,9) 38 46 AGC,RC (17,13,8) 11 4 4 AGC (18,14,9) 18 20 AGC,RC (18,9,9) 1403 4 4 AGC (19,10,9) 1326 2047 AGC,RC (18,10,9) 387 4 4 GC (19,11,9) A4 431 615 AGC,RC (18,11,9) 104 4 AGC (19,12,9) 163 213 AGC,RC (18,12,9) 43 4 4 AGC (19,13,9) 71 83 AGC,RC (18,13,9) 19 4 4 AGC (19,14,9) 33 38 AGC,RC (18,14,9) 9 4 4 AGC (19,15,9) 15 17 AGC,RC (19,9,9) 3519 4 4 AGC (20,13,10) 130 167 AGC,RC (19,10,9) 909 4 4 GC (20,14,10) A4 58 69 AGC,RC (19,11,9) 215 4 AGC (20,15,10) 31 33 AGC,RC (19,12,9) 80 4 4 AGC (20,16,10) 13 16 AGC,RC (19,13,9) 35 4 4 AGC,RC (9,6,4) 20 21 AGC,RC (19,14,9) 16 4 4 AGC,RC (10,5,5) 175 176 AGC,RC (19,15,9) 7 4 4 AGC,RC (10,7,5) AGC,RC (20,13,10) 16 17 64 4 4 AGC,RC (11,7,5) AGC,RC (20,14,10) 36 37 29 4 4 GC,RC A4 AGC,RC (20,15,10) (11,8,5) 13 14 14 4 AGC,RC (12,5,6) 1369 1381 AGC,RC (20,16,10) 6 4 4 g Later improved in [16] by a heuristic approach based on an Evolutionary Algorithm.

(see [20] for details on parameter tuning and experimental settings) were discussed for 6 n 18 and 4 d 18, plus M(19, 17) and M(20, 19). The new best-known results retrieved by VNS (New LB) are summarised in Table III, where they are compared with the previous state-of-the-art results (Old LB). Superscripts reect the domain on which the VNS method was run. Beside providing the rst non-trivial bound for some of the instances, the algorithm was also able to provide substantial improvements over the previous bestknown results (see, for example, M(15, 13)). VII. P ERMUTATION CODES WITH SPECIFIED PACKING
RADIUS

Using the notation introduced in the previous section, a ball of radius e surrounding a word w Sn is composed of all the

130

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE III I MPROVED PERMUTATION CODES . Problem Old New Problem Old New LB LB LB LB 27132h M(15,14) 56 3588 4810 M(16,13) 1266 906 M(16,14) 269 195i M(18,17) 54 70 52 M(19,17) 343 h 6076 M(20,19) 78 84 243 cycles of words of length n 1 instead of

TABLE IV I MPROVED PERMUTATION CODES WITH A SPECIFIED PACKING RADIUS . Old New LB LB P[5,2] 5 10 P[6,2] 18 30 P[6,3] 6 P[7,2] 77 126 P[7,3] 7 22 P[8,4] 8 P[9,4] 9 25 P[10,4] 49 110j P[10,5] 10 P[11,5] 11 33 j VNS on cycles of words k VNS on cycles of words words. Problem Problem P[12,5] P[12,6] P[13,4] P[13,5] P[13,6] P[14,4] P[14,5] P[14,6] P[15,6] Old LB 60 4810 195 13 6552 2184 52 243 New LB 144j 12 15120k 612k 40 110682k 3483 169k 769

M(13,8) M(13,9) M(13,10) M(13,11) M(14,13) M(15,11) M(15,13) h VNS on words. i VNS on cycles of words of length n instead of words.

permutations of Sn with Hamming distance from w at most e. Given a permutation code C, the packing radius of C is dened as the maximum value of e such that the balls of radius e centred at words of C do not overlap. The maximum number of permutations of length n with packing radius at least e is denoted by P[n, e]. From a practical point of view, a permutation code (see Section VI) with d = 2e+1 or d = 2e+2 can correct up to e errors. On the other hand, it is known that in an (n, 2e) permutation code the balls of radius e surrounding the codewords may all be pairwise disjoint, but usually some overlap. Thus an (n, 2e) permutation code is generally unable to correct e errors using nearest neighbour decoding. On the other hand, a permutation code with packing radius e (denoted [n, e]) can always correct e errors. Thus, the packing radius more accurately species the requirement for an e-error-correcting permutation code than does the minimum Hamming distance [21]. A basic VNS algorithm involving Clique Search only (Iterated Clique Search) was presented, among other methods, in [21]. The method was tested on instances with 4 n 15 and 2 e 6 (all parameter tunings and experimental settings are described in the paper). The new best-known lower bounds retrieved by the VNS method (New LB) are summarised in Table IV, comparing it with the previous state-of-the-art bound (Old LB). Notice that also in this case superscripts reect the domain on which the VNS method was run. As in Section III, for complexity reasons, it was sometimes convenient to run the method on cycles of words of length n or n 1 instead of words. From the results of Table IV it can be observed how the improvements over the previous state-of-the-art are sometimes remarkable (see, for example, P[14, 4]). VIII. C ONCLUSIONS A heuristic framework based on Variable Neighbourhood Search for code design has been described. Experimental results carried out on four different code families, used in different applications, have been presented. Parameter tuning has been carried out for all algorithms used for these applications, and is described in the referenced papers. However, it has been observed that the exact choice of parameters is not particularly critical. From the experiments it is clear that heuristics are a valuable additional tool in the design of new improved codes.

of length n instead of words. of length n 1 instead of

ACKNOWLEDGMENT R. Montemanni and M. Salani acknowledge the support of the Swiss Hasler Foundation through grant 11158: Heuristics for the design of codes. R EFERENCES
[1] M. K. Gupta, The quest for error correction in biology, IEEE Engineering in Medicine and Biology Magazine, vol. 25, no. 1, pp. 4653, 2006. [2] O. D. King, Bounds for DNA codes with constant GC-content, Electronic Journal of Combinatorics, vol. 10, #R33, 2003. [3] W. Chu, C. J. Colbourn and P. Dukes, Constructions for permutation codes in powerline communications, Designs, Codes and Cryptography, vol. 32, pp. 5164, 2004. [4] P. Hansen and N. Mladenovi , Variable neighbourhood search: princ ciples and applications, European Journal of Operational Research, vol. 130, pp. 449467, 2001. [5] R. Montemanni and D. H. Smith, Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm, Journal of Mathematical Modelling and Algorithms, vol. 7, pp. 311326, 2008. [6] A. E. Brouwer, J. B. Shearer, N. J. A. Sloane, and W. D. Smith, A new table of constant weight codes, IEEE Transactions on Information Theory, vol. 36, pp. 13341380, 1990. [7] Y. M. Chee and S. Ling, Improved lower bounds for constant GCcontent DNA codes, IEEE Transactions on Information Theory, vol. 54, no. 1, pp. 391394, 2008. [8] R. Montemanni and D. H. Smith, Heuristic algorithms for constructing binary constant weight codes, IEEE Transactions on Information Theory, vol. 55, no. 10, pp. 46514656, 2009. [9] R. Carraghan and P. Pardalos, An exact algorithm for the maximum clique problem, Operations Research Letters, vol. 9, pp. 375382, 1990. [10] D. C. Tulpan, H. H. Hoos, and A. E. Condon, Stochastic local search algorithms for DNA word design, Lectures Notes in Computer Science, Springer, Berlin, vol. 2568, pp. 229241, 2002. [11] P. J. Kuekes, W. Robinett, R. M. Roth, G. Seroussi, G. S. Snider, and R. S. Williams,Resistor-logic demultiplexers for nanoelectronics based on constant-weight codes Nanotechnology, vol. 17, pp. 10521061, 2006. [12] J. N. J. Moon, L. A. Hughes, and D. H. Smith, Assignment of frequency lists in frequency hopping networks, IEEE Transactions on Vehicular Technology, vol. 54, no. 3, pp. 11471159, 2005. [13] A. E. Brouwer, Bounds for binary constant weight codes. http://www.win.tue.nl/aeb/codes/Andw.html. [14] P. Gaborit and O. D. King, Linear construction for DNA codes, Theoretical Computer Science, vol. 334, pp. 99113, 2005. [15] D. C. Tulpan and H. H. Hoos, Hybrid randomised neighbourhoods improve stochastic local search for DNA code design, Lecture Notes in Computer Science, Springer, Berlin, vol. 2671, pp. 418433, 2003.

131

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

[16] R. Montemanni, D. H. Smith, and N. Koul, Three metaheuristics for the Construction of Constant GC-content DNA codes, in: S. Voss and M. Caserta (eds.), Metaheuristics: Intelligent Decision Making, (Operations Research / Computer Science Interface Series). Springer-Verlag New York, 2011. [17] D. H. Smith, N. Aboluion, R. Montemanni, and S. Perkins, Linear and nonlinear constructions of DNA codes with Hamming distance d and constant GC-content, Discrete Mathematics, vol. 311, no. 14, pp. 1207 1219, 2011. [18] N. Aboluion, D. H. Smith and S. Perkins, Linear and nonlinear constructions of DNA codes with Hamming distance d, constant GCcontent and a reverse-complement constraint, Discrete Mathematics, vol. 312, no. 5, pp. 10621075, 2012. [19] N. Pavlidou, A.J. Han Vinck, J. Yazdani and B. Honary, Power line communications: state of the art and future trends, IEEE Communications Magazine, vol. 41, no. 4, pp. 3440, 2003. [20] D. H. Smith and R. Montemanni, A new table of permutation codes, Designs, Codes and Cryptography, Online First, 2011, DOI 10.1007/s10623-011-9551-8. [21] D. H. Smith and R. Montemanni, Permutation codes with specied packing radius, Designs, Codes and Cryptography, Online First, 2012, DOI: 10.1007/s10623-012-9623-4.

132

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Spatial Join with R-Tree on Graphics Processing Units


Tongjai Yampaka
Department of Computer Engineering Chulalongkorn University Bangkok, Thailand Tongjai.Y@Student.chula.ac.th
Abstract: Spatial operations such as spatial join combine two objects on spatial predicates. It is different from relational join because objects have multi dimensions and spatial join consumes large execution time. Recently, many researches tried to find methods to improve the execution time. Parallel spatial join is one method to improve the execution time. Comparison between objects can be done in parallel. Spatial datasets are large. R-Tree data structure can improve the performance of spatial join. In this paper, a parallel spatial join on Graphic processor unit (GPU) is introduced. The capacity of GPU which has many processors to accelerate the computation is exploited. The experiment is carried out to compare the spatial join between a sequential implementation with C language on CPU and a parallel implementation with CUDA C language on GPU. The result shows that the spatial join on GPU is faster than on a conventional processor. Keyword: Spatial Join, Spatial Join with R-tree, Graphic

Prabhas Chongstitvatana
Department of Computer Engineering Chulalongkorn University Bangkok, Thailand prabhas@chula.ac.th

processing unit I. INTRODUCTION

The evolution of Graphic Processing Unit is driven by the demand for real time, high-definition and 3-D graphics. The requirement for an efficient and fast computation has been met by parallel computation [1]. In addition, GPU architecture that supports parallel computation is programmable to solve other problems. This new trend is called General Purpose computing on Graphic processors (GPGPU). Developers can use the capacity of GPU to solve other problem beside graphics and can improve the execution time by parallel computation. In a spatial database, storing and managing complex and large datasets such as Graphic Information system (GIS) and Computer-aided design (CAD) are time consuming. A spatial database characteristic is different from a relational database because of data type. Spatial data types are point, line and polygon. The type of data depends on the characteristic of objects, for example a road is represented by a line or a city is represented by a polygon. An object shape is created by x, y and z coordinates. Therefore, spatial operations in a spatial database are not the same as operations in a relational database. There are specific techniques for spatial operations. Spatial join combines between two objects on spatial predicates, for example, find intersection between two objects. It is an expensive operation because spatial

datasets can be complex and very large. Their processing cost is very high. To solve this problem R-Tree is used to improve the performance for accessing data in spatial join. Spatial objects are indexed by spatial indexing [2] [3]. The objects are represented by minimum bounding rectangles which cover them. An internal node points to children nodes that are covered by their parents. A leaf node points to real objects. The join with R-Tree begins with a minimum bounding rectangle. The test for an overlap is performed from a root node to a leaf node. It is possible that there are overlaps in sub-trees too. The previous work [4] introduces a technique for spatial join that can be divided into two steps. Filter Step: This step computes an approximation of each spatial object, its minimum bounding rectangle. This step produces rectangles that cover all objects. Refinement Step: In this step, spatial join predicates are performed over each object. Recently, spatial join techniques have been proposed in many works. In a survey [5], many techniques to improve spatial join are described. One technique shows a parallel spatial join that improves the execution time for this operation. This paper presents a spatial join with R-Tree on Graphic processing units. The parallel step is executed for testing an overlap. The paper is organized as follow. Section 2 explains the background and reviews related works. Section 3 describes the spatial join with R-Tree on Graphic processing units. Section 4 explains the experiment. The results are presented in Section 5. Section 6 concludes the paper.

II. BACKGROUND AND RELATED WORK A. Spatial join with R-Tree Spatial join combines two objects with spatial predicates. Objects have multi-dimension so it is important to efficiently retrieve data. In a survey [5], techniques of spatial join are presented. Indexing data such as R-Tree is one method which improves I/O time. In [6], R-Tree is used for spatial join. Before executing a spatial join predicate in the leaf level, an overlap between two objects from parent nodes is tested. When parent nodes are overlapped the search is continue into sub-trees that are covered by its parents. The sub-trees which are not overlapped from parent nodes are ignored. The reason is that the overlapped parent nodes are probably

133

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

overlapped with leaf nodes too. The next step, the overlap function test is called with sub-trees recursively. This algorithm is shown in Figure 1 SpatialJoin(R,S): For (all ptrS S) Do For (all ptrR R with ptrR.rect ptrS.rect ) Do If (R is a leaf node) Then Output (ptrR , ptrS ) Else Read (ptrR.child); Read (ptrS.child) SpatialJoin(ptrR.child, ptrS.child) End End End SpatialJoin;
Figure 1 Spatial join with R-Tree

The algorithm is designed for parallel operation on a CPU. In this paper we use the same idea for the algorithm but it is implemented on a GPU. In other research [8], R-Tree is used in parallel search. The algorithm distributes objects to separate sites and creates index data objects from leaves to parents. Every parent has entries to all sites. A search query such as windows query can perform search in parallel. C. Spatial query on GPU For a parallel operation in GPU, the work in [9] implements a spatial indexing algorithm to perform a parallel search. A linear-space search algorithm is presented that is suitable for the CUDA [1] programming model. Before the search query begins, a preparation of data array is required for the R-Tree. This is done on CPU. Then the data array is loaded into device memory. The search query is launched on GPU threads. The data structure has two data arrays represented in bits. The arithmetic at bit level is exploited. The first array stores MBR co-ordinate referred to the bottom-left and top-right co-ordinates of the i MBR in the index. The second array is an array of R-Tree nodes. R-Tree nodes store the set {MBRi, childNode|t|}. ChildNode|t| is an index into the array representing the children of the node i. When the search query is called, the GPU kernel creates threads to execute the tasks. Then copy two data arrays to memory on device. Finally the main function in GPU is called. The algorithm is shown in Figure 3. The result is copied back to CPU when the execution on GPU is finished. Clear memory array (in parallel). For each thread if Search[i] is: For each search[i] overlaps with the query MBR node j: If the child node j is a leaf, mark it as part of the output. If the child node j is not a leaf, mark it in the Next Search array. Sync Threads Copy next Search array into Search[i] (in parallel).
Figure 3 R-Tree Searches on GPU

The work [6] presents a spatial join with R-Tree that improves the execution time. However, this algorithm is designed for a single-core processor. The proposed algorithm is based on this work but the implementation is on Graphics Processing Units. B. Parallel spatial join with R-Tree To reduce the execution time of a spatial join, a parallel algorithm can be employed. The work in [7] describes a parallel algorithm for a spatial join. A spatial join has two steps: filter step and refinement step. The filter step uses an approximation of the spatial objects, e.g. the minimum bounding rectangle (MBR). The filter admits only objects that are possible to satisfy the predicate. A spatial object is defined in the form {MBRi,IDi} where i is a key-pointer data for the object. The output of this step is the set [{MBRi,IDi},{MBRj,IDj}] if MBRi intersects with MBRj. Each pair is called a candidate pair. The next step is the refinement step. Pair of candidate objects is retrieved from the disk for performing a join predicate. To retrieve data, it reads the pointers from IDi and IDj. The algorithm creates tasks for testing an overlap in the filter step in parallel. For example in Figure 2, R and S denote spatial relations. The set {R1,R2,R3,R4,R5,R6,,RN} is in R root and the set {S1,S2,S3,S4,S5,S6,,SN} is in S root. In the algorithm described here the filter step is done in parallel.
Root R

III. IMPLEMENTATION
Root S

R1 R4 R2 R5 R3 S1 S2

A.

S4

S3

R root = {R1, R2, R3, R4, R5} S root = {S1, S2, S3, S4 } Task1 (R1,S1) Task2(R1,S2) Task3 (R1,S3) Task4(R1,S4) TaskN (RN,SN) TaskN(RN,SN)

Task created

Overview of the algorithm Most works have focused on the improvement of the filter step. The first filter step assumes that the computation is done with MBR of the spatial objects. In this paper, this step is performed on CPU and the data set is assumed to be in the data arrays. The algorithm begins by parallel filtering objects on GPU. The steps of the algorithm are as follows. Step 1: The data arrays required for the R-Tree are mapped to the device memory. The data arrays are prepared on CPU before sending them to device.

Figure 2 Filter task creation and distribution in Parallel for R-tree join
134

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Step 2: Filtering step, a function to find an overlap between two MBR objects is called. Threads are created on GPU for execution in parallel. The results are the set of MBRs which are overlapping. Step 3: Find leaf nodes, the results of step 2, the set of MBRs, are checked whether they are in the leaf nodes or not. If they are the leaf nodes, return the set as the result and send them to the host. If they are not the leaf nodes and then they are used as input again recursively until reaching leaf nodes. B. Data Structure in the algorithm Assume MBRs objects are stored in a table or a file. In the join operation, there are two relations denote as R and S. MBRs structure (shown in C language syntax) are in the form:
Struct MBR_object { int min_x,max_x,min_y,max_y; }; /*x, y coordinate rectangle of object*/ Struct MBR_root { int min_x,max_x,min_y,max_y; child[numberOfchild]; }; /*x, y coordinate rectangle of root*/ MBR_root rootR [numberOfrootR]; MBR_root rootS [numberOfrootS]; /*Array of rootR and rootS relation*/ MBR_object objectR [numberOfobjectR]; MBR_object objectS [numberOfobjectS]; /*Array of objectR and objectS relation*/

An example is shown in Figure 4. It has five rectangles of objects. The objects are ordered according to x-coordinate of the rectangle. The sorted list is {A, D, B, E, C}. Define objects per pack as three. The assignments of objects into packs are: Pack1 = {A, D, B} Pack2 = {C, E} In the next step, a root is created. Compute min x, min y and max x, max y. Pack1 Max y R1
B A E D

Pack2 R2
C

Min y Min x Max x


Figure 5 MBRs after split node R-Tree

The root node of pack1 is R1 and the root node of pack2 is R2. R1 points to three objects: A, D and B. R2 points to two objects: C and E. The root coordinate is computed from min x, min y max x, max y of all objects which the root covers them. In the example, only one relation is shown. R-Tree creation is done on CPU. The difference is in the spatial join operation. The spatial join on CPU is sequential and on GPU is parallel. D. Spatial join on GPU To parallelize a spatial join, the data preparation is carried out on CPU, such as MBRs calculation and splitting R-Tree nodes. In GPU, the overlap function and the intersection join function are executed in parallel. 1) Algorithm Overlap: This step is the filter step for testing the overlap between root nodes R and S. 1. Load MBR data arrays (R and S) to GPU. 2. Test the overlap Ri and Sj in parallel. 3. The overlap function call is: Overlap ((Sj.x_min < Ri.x_max) and (Sj.x_max > Ri.x_min) and (Sj.y_min < Ri.y_max) and (Sj.y_max > Ri.y_min)) 4. For each Ri overlap Sj 5. Find Ri and Sj children nodes. Find children: Find children nodes which are covered by the root Ri and Sj. a) The information from MBRs indicates the children that are covered by the root. b) Load children data and send them to the overlap function.

C. R-Tree Indexing An R-Tree is similar to a B-Tree which the index is recorded in a leaf node and it points to the data object [4]. All minimum bounding rectangles are created by x, y coordinates of objects. The index of data is created by packing R-Tree technique [10]. The technique is divided into three steps: 1) Find the amount of objects per pack. The number of child is between a lower bound (m) and an upper bound (M) values. 2) Sort the data on x or y coordinates of rectangle. 3) Assign rectangles from the sort list to the pack successively, until the pack is full. Then, find min x, y and max x, y for each pack to create the root node.

C A B E D

Figure 4 MBRs before split node R-Tree

135

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Test intersection: This is the refinement step. Compute the join predicate on all children of Ri and Sj using the overlap function above. 2) GPU Program Structure CUDA C language is used. The language has been designed to facilitate graphic rendering on Graphics processing units. CUDA program has two phases [11]. In the first phase, the program on CPU, called host code, performs the data initialization and transfers data from host to device or from device to host. On the second phase, the program on GPU, called the device code, makes use of the CUDA runtime system to generate threads for execution of functions. All threads execute the same code but operate on different data at the same time. A CUDA function uses the keyword __global__ to define function that is a kernel function. When the kernel function is called from the host, CUDA generates a grid of threads on the device. In the spatial join, the overlap function is distributed to different blocks and is executed at the same time with different data objects. To divide the task, every block has a block identity calls blockIdx. For example: Objects Relation R = {Robject0, Robject1, Robject2,..,RobjectN}, Relation S = {Sobject0, Sobject1, Sobject2,,SobjectN} Overlap function: Compare all objects. Find x and y coordinates in the intersection predicate. The sequential program on CPU executes only one pair of data at the one time. Robject0 compare Sobject0 Robject0 compare Sobject1 Robject0 compare Sobject2 ... RobjectN compare SobjectN..timeN On GPU, the CUDA code on device generates blocks for execution all data on different blocks. Block0 = Robject0 compare Sobject0 Block1 = Robject0 compare Sobject1 Block2 = Robject0 compare Sobject2 ... BlockN = RobjectN compare SobjectN The memory is allocated for execution between CPU and GPU. First, allocate memory for data structure of root R-Tree and MBRs of objects. Second, allocate memory of data arrays to store results. When the task is done copy data arrays back to host. The nested loop is transformed to run in parallel. The rectangle of objects are mapped to 2D block on GPU. The outer loop is mapped to blockIdx.x and the inner loop is mapped to threadIdx.y. The call to kernel function is: kernel<<<number of outer loop,number of inner loop>>>. CUDA kernel generates blocks and threads for execution.

IV. EXPERIMENTATION A. Platform The spatial join is coded in C language for sequential version. CUDA C language is used in parallel version. Both versions run on Intel Core i3 2.93 GHz DDR3 2048 MB memory. GPU NVIDIA GT440 1092 MHz. 1024 MB and CUDA 96 Cores. B. Dataset In the experiment, the dataset is retrieved from R-Tree portal [12]. In the data preparation step the minimum bounding rectangles are pre-computed. The attributes in the dataset consist of Roads join River in Greece, Streets join Real roads in Germany.
TABLE I DATASET IN EXPERIMENTATION Pair of dataset
Greece

Amount MBRs 47,918 67,008

Data size

Rivers join Roads


Germany

0.7 MB 0.6 MB

Streets join Real roads

Table 1 shows the number of MBRs and the size of dataset. All datasets are in text file. A C function is used to read data from a text file to data arrays. V. RESULT Spatial join is tested with dataset in Table 1 with two functions (Overlap function of root nodes and Intersection function of children nodes). In the experiment, the time to read data from text files and stores them to data arrays is ignored. The execution time of spatial join operation between CPU and GPU is compared. The generation of RTree is done on CPU in both sequential and parallel version. Only the spatial join operations are different. A. Performance comparison between sequential and parallel The results are divided into two functions: overlap and intersect.
TABLE II EXECUTION TIME ON GPU AND CPU Pair of dataset
Greece

Overlap (ms)
CPU GPU

Intersection (ms)
CPU GPU CPU

Total (ms)
GPU

Rivers join Roads


Germany

18

72.67

22.33

90.67

26.33

Streets join Real roads

5.33

74.00

39.67

79.33

43.67

The result in Table 2 shows that the execution time on GPU is faster than on CPU. For the dataset 1, the overlap function on GPU is 77.78% faster (4 ms versus 18 ms or about 4x); the intersection function is 69.27% faster (3x). The total execution time on GPU is 70.96% faster (3.4x). For the dataset 2, the overlap function on GPU is 25% faster (1.3x); the intersection function is 46.40%
136

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

faster (1.8x). The total execution time on GPU is 44.96% faster (1.8x). The speedup depends on the data type as well. If data has larger numbers, the execution time is longer too. In the experiment, the dataset 1 is floating point data. It has six digits per one element. Execution time is higher than the dataset 2 because the dataset 2 has integer data. It has four digits per one element. The time to transfer data is significant. The data transfer time affected the execution time. The total running time in Table 2 includes the data transfer time from host to device and device to host.

The future work will be on how to automate and coordinate the task between CPU and GPU. There are other database management functions that are suitable to be implemented in GPU too. It is worth the investigation as GPU becomes ubiquitous nowadays. REFERENCE [1] NVIDIA CUDA Programming Guide, 2010. Retrieve at http://developer.download.nvidia.com [2] A. Nanopoulos, A. N. Papadopoulos and Y. Theodoridis Y. Manolopoulos, R-trees: Theory and Applications, Springer, 2006. [3] Xiang Xiao and Tuo Shi, "R-Tree: A Hardware Implemention," Int. Conf. on Computer Design, Las Vegas, USA, July 14-17, 2008. [4] Gutman A., "R-tree:A Dinamic Index Structure for Spatial Searching," ACM SIGMOD Int. Conf. , 1984. [5] E.H. Jacox and H. Samet, "Spatial Join Techniques," ACM Trans. on Database Systems, Vol.V, No.N, November 2006, Pages 145. [6] Hans P. Kriegel and B. Seeger T. Brinkhoff, "Efficient Processing of Spatial Joins Using R-tree," SIGMOD Conference, 1993, pp.237-246. [7] L. Mutenda and M. Kitsuregawa, "Parallel R-tree Spatial Join for a Shared-Nothing Architecture," Int. Sym. on Database Applications in Non-Traditional Environments, Japan, 1999, pp.423-430. [8] H. Wei, Z. Wei, Q. Yin, "A New Parallel Spatial Query Algorithm for Distributed Spatial Database," Int. Conf. on Machine Learning and Cybernetics, 2008, Vol.3, pp.1570-1574. [9] M. Kunjir and A. Manthramurthy, "Using Graphics Processing in Spatial Indexing Algorithm", Research report, Indian Institute of Science, 2009. [10] K. Ibrahim and F. Cristos, "On Packing R-tree," Int. Conf. on Information and knowledge management, ACM, USA, 1993, pp.490-499. [11] David B. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors A Hands-on Approach, Morgan Kaufmann, 2010. [12] R-tree Portal. [Online]. http://www.rtreeportal.org

Figure 6 Transfer rate dataset 1, dataset 2

Figure 6 shows the data transfer rate on GPU. The dataset 1 has 47,918 records and its size is 0.7 MB. The data transfer time of this dataset is 59.53% of the execution time. The dataset 2 has 67,008 records and is 0.6 MB. The data transfer time of this dataset is 76.83% of the execution time. VI. CONCLUSION This paper describes how a spatial join operation with R-Tree can be implemented on GPU. It uses the multiprocessing units in GPU to accelerate the computation. The process starts with splitting objects and indexing data in R-Tree on the host (CPU) and copies them to the device (GPU). The spatial join makes use of the parallel execution of functions to perform the calculation over many processing units in GPU. However using Graphic Processor Unit to perform general purpose task has limitations. The symbiosis between CPU and GPU is complicate. There is a need to transfer data back and forth between CPU and GPU and the data transfer time is significant. Therefore, it may be the case that the data transfer time will dominate the total execution time if the task and the data are not carefully divided.

137

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Ontology Driven Conceptual Graph Representation of Natural Language


Supriyo Ghosh
Department of Information Technology National Institute of Technology,Durgapur Durgapur, West Bengal,713209,India Email:ghosh.supriyo.cse@gmail.com

Prajna Devi Upadhyay


Department of Information Technology National Institute of Technology,Durgapur Durgapur, West Bengal,713209,India Email: kirtu26@gmail.com

Animesh Dutta
Department of Information Technology National Institute of Technology,Durgapur Durgapur, West Bengal,713209,India Email: animeshrec@gmail.com
AbstractIn this paper we propose a methodology to convert a sentence of natural language to conceptual graph, which is a graph representation for logic based on the semantic networks of Articial Intelligence and the existential graph. A human being can express the same meaning in different form of sentences. Although many natural language interfaces(NLIs) have been developed, but they are domain specic and require a huge customization for each new domain. From our approach a casual user can get more exible interface to communicate with computer and less customization is required to shift from one domain to another. Firstly, a parsing tree is generated from the input sentence. From the parsing tree, each lexeme of the sentence is found and the basic concepts matching with the ontology is sorted out. Then relationship between them is found by consulting with domain ontology and nally the conceptual graph is built.

A conceptual graph is a bipartite graph of concept vertices alternate with (conceptual) relation vertices, where edges connect relation vertices to concept vertices [4]. Each concept vertex, drawn as a box and labelled by a pair of a concept type and a concept referent, represents an entity whose type and referent are respectively dened by the concept type and the concept referent in the pair. Each relation vertex, drawn as a circle and labelled by a relation type, represents a relation of the entities represented by the concept vertices connected to it. Concepts connected to a relation are called neighbour concepts of the relation. B. Ontology An ontology[5] is a conceptualization of an application domain in a human-understandable and machine- readable form. It is used to reason about the properties of that domain and may be used to dene that domain. As per denition of Ontology , Ontology denes basic terms and relation comprising the vocabulary of a topic area as well as the rules for combining terms and relation to dene extension to the vocabulary [6], [7]. A survey of Web tools [8] presented that extraction ontologies provide resiliently and scalability natively where in other approaches for information extraction, the problem of resiliently and scalability still remains. One serious difculty in creating the ontology manually is the need for a lot of time, effort and might contain errors. Also, it requires a high degree of knowledge in both database theory and Perl regular- expression syntax. Professional groups are building metadata vocabularies or the ontologies. Large handbuilt ontologies exist for example medical and geographic terminology. Researchers are rapidly working to build systems to automate extracting them from huge volumes of text. One more complex problem is, no formalized rule is there to dene and build the ontology. In our work we have assumed that all lexemas like noun,verb and adjectives of our experimental domain are dened as a concept or instance of a concept in the domain ontology.

I. I NTRODUCTION Now a days it is a challenging work to develop a methodology by which a human being can communicate with computer. A human can communicate only by natural language, but computer can understand a formalized data structure like conceptual graph. So, both can communicate with their proper semantics if they share a common vocabulary or ontology and there exists a proper interface which can convert the natural language into formalized data structure like conceptual graph and vice versa. A. Conceptual Graph A conceptual graph (CG)[1] is a graph representation for logic based on the semantic networks of articial intelligence and the extential graphs of Charles Sanders Peirce[2]. Many version of conceptual graph have been designed and implemented for last thirty years. In the rst published paper on CGs, [3] used them to represent the conceptual schemas used in database system. The rst book on CGs [1] applied them to a wide range of topics in articial intelligence and computer science. In [3] developed a version of conceptual graphs (CGs) as an intermediate language for mapping natural language to a relational database.

138

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

This paper is structured as follows. The Related Work and Scope of the Work are presented in section II and III. Our proposed system overview and demonstration through examples is shown in section IV. Case study for different sentences are given in section V. Finally, we conclude and draw future directions in Section VI.

III. S COPE OF W ORK Now a days people are trying to use the computer or a software tool like agent for task delegation. So, language of the human being must be converted into some formal data structure which can be understood by the computer. A number of methodologies have been developed to convert a natural language into conceptual graph. But the semantics of the language cannot be dened by the conceptual graph, until we use a common vocabulary between user and computer. This common vocabulary can be expressed in the form of ontology. A casual user can dene a single sentence in various ways though all of these have same semantic. So our approach is to present a methodology which can convert a natural language into corresponding conceptual graph by consulting with its domain ontology. So both, user and computer can understand the semantic of the conversation. Our approach builds the unique conceptual graph for various sentences in different forms but same semantics. Similarly a single word has a number of synonyms and all synonyms may not be dened in domain ontology. So if a particular concept cannot be found in ontology our system must check whether any of its synonym is dened in the domain ontology as a concept. The synonym must be identied from WordNet [19] ontology. IV. S YSTEM OVERVIEW In this work we develop a methodology by which from a natural language query or sentence, a conceptual graph is generated using the dened concept and relationship between the concepts of the domain ontology. Using this approach a casual user and computer can communicate, if they share a common ontology or vocabulary. We develop the methodology of converting a sentence into conceptual graph by the following four steps: 1. Grammer for accepting natural language. 2. Parsing tree generation. 3. Recognizing the ontological concepts. 4. Creating Conceptual Graph by using the ontological concepts. A. Grammer for accepting natural language In this section we have dened a grammer which can recognize simple,compound and complex sentences. This grammer restricts the user to give the input sentence in a correct grammatical format. We dene a context free grammer G where G = (VN , , P, S) where, Non-Terminal: VN ={S,VP,NP,PP,V,AV,NN,P,ADJ,CONJ,DW,D} Here, S=Sentence, VP=Verb Phrase, NP=Noun Phrase, PP=Preposition Phrase, AV=Auxiliary Verb, V=Verb, NN=Noun, P=Preposition, ADJ=Adjective, CONJ=Conjunction, D=Determiner, DW=Depending words for complex sentence.

II. R ELATED W ORK A lot of methodologies have been developed to capture the meaning of a sentence by converting the natural language into its corresponding conceptual graph. But as there is no formalized rule build an ontology, it is still a challenging work to convert the whole set of natural language into a formalized machine understandable language. In [9], the authors have built the conceptual graph from natural language. But they have not dened any grammer or rule to generate parsing tree of any sentence. They also have not provided any idea to keep the same semantics by building unique conceptual graph for different sentences with same meaning. In [10], [11], [12], the authors have proposed a methodology to develop a tool to overcome the negetive effects of paraphrases, by converting a complex formed sentence into a simple format. In this approach a complex format query of the domain of interest which cannot be recognized by the system, is rearranged into a simple machine understandable format. But they lack of converting different forms of sentence into a single data structure like conceptual graph by consulting with its domain ontology. In [13], [4], the authors have built a query conceptual graph from a natural sentence by identifying the concept which needs a high computational cost. As they have not parsed the sentence, the searching cost of proper ontological concept is very high. In [14], the authors approach for building conceptual graph from natural language depend on only the semantics of verbs, which is not feasible for all the cases. In many existing ontologies, nouns and verbs both perform a very important role to capture the semantic of the sentence. In [15], [16], a natural language query is converted into SPARQL query language and by consulting with its domain ontology, the system generates answer. But this approach cannot capture the semantic of question always as the system does not consult the ontology concepts when SPARQL query is built. The work presented in [17], [18] is related to our proposed approach. Here after syntactic parsing of sentence system generates the ontological concepts. For unrecognized concepts system generates some suggession and from the users selection system learns about the ranking of the suggession in future. But this approach does not give us any notion of building same conceptual graph from various forms of sentences with same semantic.

139

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Terminal: =Any kind of valid english word like noun,verb,auxiliary-verb, adjective,preposition, determiner and null() also. Production Rule: P is the production of the grammer. Every sentence recognized by this grammar must follow these production rules. P consists of 1) For simple sentences: S<NP><VP> VP<V><NP> V <AV><V><P> | <V><CONJ><V> NP<D><ADJ><NN>< P P> P P < P >< N P > V verb| N N noun| AV auxiliary verb| P preposition| ADJ adjective| D determiner| 2) For compound sentences: S<S><CONJ><S> As compound sentence is build by joining of 2 simple sentence using a conjunction. complex sentences: S<S><DW><S> | <DW><S><,><S> As complex sentence is formed by a simple independent sentence and a dependent sentence, where every dependent sentence starts with a dependent word. Now a complex sentence is formed in two ways. 1. If dependent sentence comes rst then the sentence starts with a dependent word and two sentence must be separated by a comma(,). 2. If dependent sentence comes last then the dependent word must separate the two sentence. In case of both complex and compound sentence,the individual simple sentence(S) follows all the production rule of the simple sentence. Start symbol: A grammar has a single nonterminal (the start symbol) from which all sentences are derived. All sentences are derived from S by successive replacement using the productions of the grammar. Null symbol: it is sometimes useful to specify that a symbol can be replaced by nothing at all. To indicate this, we use the null symbol , e.g., A B|. In our dened grammar any non-terminal symbol except S,VP and NP has a null production. B. Parsing Tree Generation Whenever a normal user gives any sentence as an input to the system, a parsing tree is generated by using our dened grammer. So,from this parsing tree we can recognize noun,verb,adjective,preposition of the given input sentence. For example if input sentence is John is going to Boston by bus, a parsing tree must be generated like Figure: 1 by these production rules: 3) For

NP

VP

DET

ADJ NN PP

V AV V P

NP DET ADJ NN PP

John

is

going to

^ Boston P

NP

by

DET ADJ NN ^ ^ Bus

PP ^

Fig. 1.

Parse tree for simple sentence

CONJ

NP

VP

but

NP

VP

DET

ADJ

NN

PP

NP

DET

ADJ NN

PP

NP

The

beautiful

apple

AV

DET ADJ NN

^ ^ PP ^

It

AV

PP

is

red

is

rotten

Fig. 2.

Parse tree for compound sentence

S<NP><VP> S<DET><ADJ><NN><PP><VP> S<NN><V><NP> S<NN><AV><V><P><NP > SJohn is going to <DET><ADJ><NN><PP> SJohn is going to <NN><P><NP> SJohn is going to boston <P><DET><ADJ><NN><PP> SJohn is going to boston by <NN> SJohn is going to boston by bus Now if a user gives a compound sentence as an input the generation of parse tree starts from the production of compound sentence shown as S<S><CONJ><S> Therefore if a compound sentence like The beautiful apple is red but it is rotten. comes as an input the generated parse tree must be like Figure: 2 where each simple sentence is generated by the production rule of simple sentence. Now if a complex sentence comes as a input to the system, the generation of parse tree follows the production rule of complex sentence where both dependent and independent sentence must follow the production rule of simple sentence. The production starts from the basic production rule of complex sentence shown as: S<S><DW><S> | <DW><S><,><S> So, if a complex sentence like After the student go to the class, he can give attendence. comes as an input to the system, the generated parsing tree must be like Figure: 3. From this parsing tree we can easily identify the type of the sentence and each POS(Parts of Speech ) like nouns,verbs,adjective,preposition,determiner etc. In our next

140

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

DW

PUNC

After

NP

VP

NP

VP

DET

ADJ

NN

PP

NP DET ADJ NN

PP

NP

The

Student

AV V

DET NN pp ADJ

He ^

AV V

DET

ADJ

NN

PP

go

to

class ^

can

give

attendance ^

Fig. 3.

Parse tree for complex sentence

START

Parse the sentence using defined rule and develop the parse tree

Find IC by identifying Noun,Verb and Adjectives.

For each IC do Lagend: OC= Ontology Concept IC= Identified Concept SC= Synonym of IC IO= Instance or Individual of Ontology

YES Add the OC to the Concept List.

If IC=OC NO

Find OC YES If IC=IO of that IO NO

Find synonym of identified concept from wordbnet and match that SC with Ontology Concepts.

YES If SC==OC NO Generate suggesion of OC nearer to IC

YES

User agree with Suggested OC NO The identified concept is out of domain ontology.

1. The Identied concept is identical to any domain ontology concept(OC). 2. The IC cannot be mapped to any OC, but any synonym of that IC is dened as an OC in the ontology. 3. The IC is dened as an instance or individual of a concept in the ontology. 4. The IC is not in the domain of experiment, so it must not be recognized by the domain ontology. In the next step the Identied Concepts must be converted into ontological concepts as computer can understand only the vocabulary of ontology. So, for each identied concept of IC list different operation is performed for the above 4 cases. 1. As IC is syntactically equal to a OC, this OC will be added to ontological concepts list. 2. If the IC cannot be syntactically mapped with any OC, system checks all the synonyms of the IC from WordNet where WordNet (Fellbaum, 1998) is an English lexical database containing about 120 000 entries of nouns, verbs, adjectives and adverbs, hierarchically organized in synonym groups (called synsets), and linked with relations, such as hypernym, hyponym, holonym and others. For each synonym concept(SC) of corresponding IC the system tries to syntactically map the SC with the OC. If any syntactically mapped OC is found, that OC must be added to the Ontological Concepts list. 3. If the IC is an instance or individual of an Ontological Concept, system nd the corresponding OC of the instance and add that OC to Ontological Concepts list. 4. If the IC is not in the domain of experiment, system continue the loop for next iteration with next IC from the Identied Concepts list. So, after getting the Ontological Concepts list system build the conceptual graph as discribed in the next section. D. Creating Conceptual Graph from Ontological Concepts List In this section we propose an algorithm for generating conceptual graph from the generated ontological concepts list which consists of the following four steps: Step 1: In the concept list if two same concept occurs with same instance name or with no instance, then we keep one concept with its instance name and discard the other one. But if the same concept comes with different instance name, we keep both concepts. Let us consider the sentence India is a large country comes as an input. After parsing the sentence three concepts must be added to ontological concepts list, country:india, large and country:*. Then the system should merge the two country concepts into one, and update its ontological concepts to Country:india and large. But if a sentence John is playing with Bob. comes as input, after parsing the sentence three concepts Person:John,play and Person:Bob must be added to ontological concepts list. Though here two concepts have same name as Person, but we keep both concepts, as the two concepts have different instance names. Step 2: As conceptual graph consists of concepts and

Fig. 4. Flowchart for nding the ontological concepts of each lexeme of a sentence

section we deal with these identied lexicons. C. Recognizing the Ontological Concepts This step involves nding the ontological concepts used in the given input sentence. Our general assumption is that each lexeme in the sentence is represented by using a separate concept in the ontology, therefore all nouns, adjectives, verbs and pronouns are represented by identied concepts, while the determiners,numbers,prepositions and conjunctions are used as a referent of the relevant concept. Here we have dened an algorithm (Figure: 4) for nding the ontological concepts and instances used in a given input sentence by syntactic mapping of each lexemes with each predened ontological concepts. As we have assumed that nouns,verbs and adjectives of a particular domain are dened as a concept in its domain ontology, we have identied nouns,verbs and adjectives from the parsing tree and put them into identied concept(IC) list. For each identied concept there may be 4 cases,

141

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

CAT

Agent

Sitting

APPLE

hasColor

Color:Red

Fig. 7.
Sitting Location MAT

conceptual graph for simple sentence.

Fig. 5.

Forming of subconceptual graph

CAT

Agent

Sitting

Location

MAT

Fig. 6.

Forming of desired nal conceptual graph

relationship between the concepts, from the domain ontology the system nds the exact relationship between two consecutive concepts for each consecutive pair of concepts from ontological concepts list. As an example if a sentence Cat is sitting on the mat. comes as an input, by parsing the sentence three concepts CAT,Sitting and MAT must be added to ontological concepts list. Now in this step the system nds relationship between each pair of consecutive concepts from the domain ontology. Let in ontology the relationship is dened like thatAgent is the relationship between CAT and Sitting, and Location is the relationship between Sitting and MAT concepts. Step 3: Make subconceptual graph for consecutive pair of concepts of ontological concepts list by connecting with the identied relationship between those concepts dened in ontology. So, in our previous example of Cat is sitting on the mat, there are two pair of consecutive concepts. So, two subconceptual graph must be formed which is shown in Figure: 5. Step 4: Merge the subconceptual graphs by their common concept name and develop the nal desired conceptual graph which must be recognized by any system which have the common domain ontology. So, in the previos example of Cat is sitting on the mat, the two subconceptual graphs must be merged by their common concept Sitting and build the desired nal conceptual graph which is shown in Figure: 6. V. C ASE S TUDY In this work our basic goal is to develop a conceptual graph from a natural language sentence given by a casual user by keeping the actual semantics intact, as a computer can understand the semantic of a sentence representing it as a conceptual graph. Now the main problem is, a single sentence can be represented in various ways though all of them have the same semantic. So the conceptual graph for every sentence with unique semantic must be identical. We present here some example of this problem with three types of sentences, and every time the same conceptual graph is formed.

1) Case1 : F or simple sentences: A casual user can give a simple sentence with same semantic but in various format. 1. The apple is red. 2. The color of the apple is red. 3. The apple is of red color. These three simple sentences have the same semantic but in different form. After parsing the rst sentence we identify two concept, Apple and Red, where red is an instance of ontological concept Color and Apple is another ontological concept. So OC list contains Apple and Color:red concepts. For second and third sentence the identied concept are Color,Apple and red, where red is an instance of Color concept. So the Concept Color:* and Color:red must be merged as a single concept. So we take the concept Color:Red and discard the other Color:* concept. Finally OC list contains two concepts Apple and Color:red. So as OC list contains equal elements the developed conceptual graph is also same for three sentences which is shown in Figure: 7. 2) Case2 : F or compound sentences: A casual user can give a compound sentence with same semantic but in different format. 1. John is happy and he is lucky. 2. John is lucky and he glad too. These two compound sentence have the same semantic but in different form. After parsing the rst sentence we identify that there are two simple sentences joined by and. From the rst simple sentence we identify two concepts, John and Happy, where John is an instance of ontological concept Person and Happy is another ontological concept. So OC list contains Happy and Person:John concepts. For second simple sentece we identify two concepts he, which represnts John which is an instance of ontological concept Person, and another ontological concept Lucky. So after joining the concepts by their dened relationship, we get the conceptual graph represented in Figure: 8. Here the two individual conceptual graph are joined by the conjunction AND. For second sentence, it is also a collection of two simple sentences joined by and. For rst simple sentence, identied concepts are John, which is an instance of ontological concept Person and another ontological concept Lucky. For second simple sentece we identify a concept he, which represents john, which is an instance of ontological concept Person. But we cannot map the identied concept Glad with any ontological concept. So the system checks the WordNet for synonyms of glad and nds that glad is identical to Happy, which is a domain ontological concept. So from the ontological concepts list, system builds the conceptual graph which is also identical to Figure: 8. The dotted line represnts

142

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Person:John

hasAttribute

Happy

Person:Ram

agent

Forget

clock

Time

AND

Person:Ram

Person:John

agent

Miss

topic

Exam

hasAttribute

Lucky

Fig. 9. Fig. 8. conceptual graph for compound sentence.

conceptual graph for complex sentence.

that the two individual conceptual graph are joined by same ontological concept Person:John. 3) Case3 : F or complex sentences: A casual user can give a complex sentence with same semantic but in various format. 1. Because Ram forgot the time, he missed the test. 2. Ram missed the test as he forgot the time. These two complex sentences have the same semantics but in different forms. After parsing the rst sentence, we identify that there are two simple sentence, rst one is dependent and second one is independent simple sentece. From the rst dependent sentence we identify three concepts, Ram,Forget and ,Time where Ram is an instance of ontological concept Person and Forget and Time are another ontological concepts. So OC list contains Person:Ram,Forget and Time concepts. For independent simple sentence we identify three concepts he, which represents Ram which is an instance of ontological concept Person, and another ontological concept are Miss and Exam. So after joining the concepts by their dened relationship, we get the conceptual graph represented in Figure: 9. The dotted line represnts that the two individual conceptual graphs is joined by same ontological concept Person:Ram. For second sentence, it is also a collection of two sentences, rst one is independent sentence and second is dependent sentence. For rst independent simple sentence the identied concepts are Ram, which is an instance of ontological concept Person and another concepts are Miss and Test. Now Miss is a dened ontological concept, but Test cannot be mapped with any concept. So the system checks the synonyms of Test from WordNet. It nds that, a synonym of Test is Exam and it can be mapped with ontological concept Exam. So Exam must be added to OC List. The nal OC list contains three concepts Person:Ram,Miss and Exam. For dependent sentece we identify a concept he, which represents Ram, which is an instance of ontological concept Person while Forget and Time are two dened ontological concepts. So from the ontological concepts list, system builds the conceptual graph which is also identical to Figure: 9. VI. C ONCLUSION In this work, we have dened a formal methodology to convert the natural language sentence into its corresponding

conceptual graph form by consulting with its common domain ontology. Thus a casual user and a computer can interact with each other keeping the semantics of communication. But this approach cannot deal with complex sentences properly. Issues related to complex sentences are how to break it into simple sentences and whether the simple sentences are causally related or not. If they are causally related, then which sentence must be executed rst. So the future prospect of the work will be to dene a formal method by which the problem of representing complex sentence can be overcome. Another prospect of the work is to dene a methodology which can deal with any kind of domain ontology, where all the verbs,nouns and adjectives of a particular domain may not be dened as a concept or instance of a concept. In other words a methodology have to be developed which is independent of how the ontology is dened. ACKNOWLEDGMENT We are really greatful to our Information Technology department of NIT,Durgapur for giving us a perfect environment and all the facilities to do this work. R EFERENCES
[1] Sowa, J. F.: Conceptual Structures Information Processing in Mind and Machine, Addison-Wesley, Reading (1984) [2] F. V. Harmelen, V. Lifschitz, and B. Porter: Handbook of Knowledge Representation Elsevier, 2008, pp 213237. [3] Sowa, John F. (1976): Conceptual graphs for a database interface, IBM Journal of Research and Development 20:4, 336357. [4] Cao, T.H., Cao, T.D., Tran, T.L.: A Robust Ontology-Based Method for Translat- ing Natural Language Queries to Conceptual Graphs, In: Domingue, J., Anutariya, C. (eds.) ASWC 2008. LNCS, vol. 5367, pp. 479492. Springer, Heidelberg (2008) [5] C. Snae and M.1 Brueckner: Ontology-Driven E-Learning System Based on Roles and Activities for Thai Learning Environment, In Interdisciplinary Journal of Knowledge and Learning Objects Volume 3,2007. [6] Nicholas Gibbins, Stephen Harris, Nigel Shadbolt. Agent based Semantic Web Services, May 2024, 2003, Budapest, Hungary. ACM 2003. [7] A. G. Perez, M. F. Lopez and O. Corcho: Ontological Engineering, (Springer) [8] Alberto H. F., Berthier A., Altigran S., Juliana S. A Brief Survey of Web Data Extraction Tools, ACM SIGMOD Record, v.31 n.2, June 2002. [9] Wael Salloum: A Question Answering System based on Conceptual Graph Formalism ,IEEE Computer Society Press, New York, 2009. [10] D. Moll and M. Van Zaanen: Learning of Graph Rules for Question Answering, Proc. ALTW05, Sydney, December 2005. [11] F. R. James, J. Dowdall, K. Kaljur, M. Hess, and D. Moll: Exploiting Paraphrases in a Question Answering System, In Proc. Workshop in Paraphrasing at ACL2003, 2003.

143

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

[12] F. D. France, F. Yvon and O. Collin: Learning Paraphrases to Improve a Question-Answering System, In Proceedings of the 10th Conference of EACL Workshop Natural Language Processing for Question-Answering, 2003. [13] Tru H. Cao and Anh H. Mai: Ontology-Based Understanding of Natural Language Queries Using Nested Conceptual Graphs, Lecture Notes in Computer Science, Springer-Verlag, 2010, Volume 6208/2010, pp. 7083 (2010) [14] Svetlana Hensman: Construction of Conceptual Graph representation of texts, In Proceedings of Student Research Workshop at HLT-NAACL, Boston, 2004, 4954. [15] Damljanovic, D., Tablan, V., Bontcheva, K.: A text-based query interface to owl ontologies, In: 6th Language Resources and Evaluation Conference (LREC). ELRA, Marrakech (May 2008) [16] Tablan, V., Damljanovic, D., Bontcheva, K.: A natural language query inter- face to structured information, In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 361375. Springer, Hei- delberg (2008) [17] Danica Damljanovic, Milan Agatonovic, and Hamish Cunningham: Natural Language Interfaces to Ontologies: Combining Syntactic Analysis and Ontology-Based Lookup through the User Interaction, In: Proceedings of the 7th Extended Semantic Web Conference (ESWC 2010). Lecture Notes in Computer Science, Springer-Verlag, Heraklion, Greece (June 2010) [18] Damljanovic, D., Agatonovic, M., Cunningham, H.: Identication of the Question Focus: Combining Syntactic Analysis and Ontology-based Lookup through the User Interaction, In: 7th Language Resources and Evaluation Conference (LREC). ELRA, La Valletta (May 2010) [19] George A.Miller, WordNet: An On-line Lexical Database, in the International Journal of Lexicography, Vol.3, No.4, 1990.

144

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Web Services Privacy Measurement Based on Privacy Policy and Sensitivity Level of Personal Information
Punyaphat Chaiwongsa and Twittie Senivongse
Computer Science Program, Department of Computer Engineering Faculty of Engineering, Chulalongkorn University Bangkok, Thailand punyaphat.c@student.chula.ac.th, twittie.s@chula.ac.th

AbstractWeb services technology has been in the mainstream of todays software development. Software designers can select Web services with certain functionality and use or compose them in their applications with ease and flexibility. To distinguish between different services with similar functionality, the designers consider quality of service. Privacy is one aspect of quality that is largely addressed since services may require service users to reveal personal information. A service should respect the privacy of the users by requiring only the information that is necessary for its processing as well as handling personal information in a correct manner. This paper presents a privacy measurement model for service users to determine privacy quality of a Web service. The model combines two aspects of privacy. That is, it considers the degree of privacy principles compliance of the service as well as the sensitivity level of user information which the service requires. The service which complies with the privacy principles and requires less sensitive information would be of high quality with regard to privacy. In addition, the service WSDL can be augmented with semantic annotation using SAWSDL. The annotation specifies the semantics of the user information required by the service, and this can help automate privacy measurement. We also present a measurement tool and an example of its application. Keywords-privacy; privacy policy; measurement; Web services; ontology personal information;

requiring only the information that is necessary for its processing as well as handling personal information in a correct manner. From a view of a service user, proper handling of the disclosed personal information is highly expected. From a view of a software designer who is developing a service-based application, it is desirable to select a Web service with privacy quality into the application since the privacy quality of the service contributes to that of the application. The application itself should also respect the privacy of the application users. In this paper, we present a privacy measurement model for service users to determine privacy quality of a Web service. The model combines two aspects of privacy. That is, it considers the degree of privacy principles compliance of the service as well as the sensitivity level of user information which the service requires. The model follows the approach by Yu et al. [1] which assesses if the privacy policy of a Web service complies with a set of privacy principles. We enhance it by also considering sensitivity level of users personal information. The approach by Jang and Yoo [2] is adapted to determine sensitivity level of personal information that is exchanged with the service. According to our privacy measurement model, a service which complies with the privacy principles and requires less sensitive information would be of high quality with regard to privacy. In addition, we develop a supporting tool for the model. The tool relies on augmenting WSDL data elements of the service with semantic annotation using the SAWSDL mechanism [3]. The annotation specifies the meaning of WSDL data elements based on personal information ontology, i.e., a semantic term associated with a data element indicates which personal information the data element represents. Semantic annotation is useful for disambiguating user information that may be named differently by different Web services. As a result, it helps automate privacy measurement and facilitates the comparison of privacy quality of different Web services. Combining these two aspects of privacy, the model is considered practical for service users since the assessment is based on the privacy policy and service WSDL which can be easily accessed. Section II of this paper discusses related work. Section III describes an assessment of privacy policy of a Web service based on privacy principles and Section IV presents measurement of sensitivity level of personal information. The privacy measurement model combining these two aspects of

I.

INTRODUCTION

Web services technology has been in the mainstream of software development since it allows software designers to use Web services with certain functionality in their applications with ease and flexibility. Software designers study service information that is published on service providers Web sites or through service directories and select the services that have the functionality as required by the application requirements. For those with similar functionality, different aspects of quality of service (QoS) are usually considered to distinguish them. Privacy is one aspect of quality that is largely addressed since Web services may require service users to reveal personal information. An online shopping Web service may ask a user to give personal information such as name, address, phone number, and credit card number when buying products, and a student registration Web service of a university would also ask for students personal information to maintain student records. A Web service should respect the privacy of service users by

145

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

privacy is proposed in Section V. The supporting tool is described in Section VI and the paper concludes in Section VII. II. RELATED WORK

III.

ASSESSMENT OF WEB SERVICE PRIVACY POLICY

W3C has stated in the Web Services Architecture Requirements [4] that Web services architecture must enable privacy protection for service consumers. Web services must express privacy policy statements which comply with the Platform for Privacy Preferences (P3P), and the policy statements must be accessible to service consumers. Service providers generally publish privacy policy statements which follow privacy protection guidelines proposed by governmental or international organizations, and these statements are the basis for privacy protection measurement. A. Related Work in Privacy Measurement Based on Privacy Policy Following Canadian Standards Association Privacy Principles, Yee [5] specifies how to define privacy policy, and a method to measure how well a service protects user privacy based on measurement of violations of the users privacy policy. The work is extended to consider compliances between E-service provider privacy policy and user privacy policy using a privacy policy agreement checker [6]. Similarly, Xu et al. [7] provide for a composite service and its user a policy compliance checker which considers sensitivity levels of personal data that flow in the service together with trust levels and data flow permission given to the services in the composition. Tavakolan et al. [8] propose a model for privacy policy and a method to match and rank privacy policies of different services with users privacy requirements. We are particularly interested in the work by Yu et al. [1] which follows 10 privacy principles defined in the Australia National Privacy Principles (Privacy Amendment Act 2000). The work proposes a checklist to rate privacy protection of a Web service with regard to each privacy principle. A privacy policy checker which can be plugged into the Web service application is also developed to check for privacy principles compliance. B. Related Work in Privacy Measurement Based on Sensitivity Level of Personal Information Yu et al. [9] present a QoS model to derive privacy risk in service composition. The privacy risk is computed using the percentage of private data the users have to release to the services. The users can define weights that quantify a potential damage if the private data leak. Hewett and Kijsanayothin [10] propose privacy-aware service composition which finds an executable service chain that satisfies a given composite service I/O requirements with minimum number of services and minimum information leakage. To quantify information leakage, sensitivity levels are assigned to different types of personal information that flows in the composition. The composition also complies with users privacy preferences and providers trust. We are particularly interested in the comprehensive view of privacy sensitivity level of Jang and Yoo [2]. They address four factors of sensitivity, i.e. degree of conjunction, principle of identity, principle of privacy, and value of analogism. They also give a guideline to evaluate these sensitivity factors which we can adapt for the work.

For the privacy policy aspect, we simply adopt a privacy principles compliance assessment by Yu et al. [1]. According to the Australia National Privacy Principles (Privacy Amendment Act 2000), there are 10 privacy principles for proper management of personal information. For each principle, Yu et al. list a number of criteria to rate privacy compliance of a service. For full detail of the compliance checklist, see [1]. Here we show a small part of the checklist through our supporting tool in Fig. 1. For instance, there are 3 criteria that a service has to follow to comply with the collection principle, i.e., the privacy policy statements must state (1) the kind of data being collected, (2) the method of data collection, and (3) the purpose of data collection. The service user can check with the published privacy policy how many of these criteria the service satisfies, and then give the compliance rating score. Thus for the collection principle, the maximum rating is 3; the rating ranges between 0-3. The service user can also define a weighted score for each privacy principle denoting the relative importance of each principle. The total privacy principle compliance (Pcom) score of a service is computed by (1) [1]: 10 ri (1) Pcom * pi rimax i 1 where ri = rating for principle i assessed by service user rimax = maximum rating for principle i pi = weighted score for principle i assigned by service 10 user, and pi 100 .

i 1

Pcom ranges between 0-100. Instead we will later use a normalized NPcom, as in (2), which ranges between 0-1 in our privacy measurement model in Section V:
10 Pcom . ri NPcom * pi /100 100 i 1 rimax

(2)

As an example, a user of a Register service of a university, which registers student information, rates and gives a weight for each privacy principle as in Table I. Pcom of this service then is 87.08 and NPcom is 0.87.

Figure 1. Assessing privacy principles compliance using our tool.

146

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

TABLE I. No.

EXAMPLE OF PRIVACY PRINCIPLES COMPLIANCE RATING Privacy Principles Rating ri 2 2 2 2 2 3 2 0 2 1 Max Rating rimax 3 2 2 2 2 4 2 1 2 1 Total Weight pi 20 10 5 10 5 5 2 5 8 30 100 Score ri/rimax*pi 13.33 10 5 10 5 3.75 2 0 8 30 Pcom = 87.08 NPcom = 0.87

N is a finite set of attributes which describe the concepts and can be described as N = {n1, n2, ..., nm}, and R is a binary relation between G and N, i.e., R G N. For example, g1 R n1, or (g1, n1) R, represents that the concept g1 has an attribute n1. The formal concepts can also be described using a cross table. We extend the cross table of [2] to create one as shown in Table II. Here personal information is classified into 7 concepts, i.e., G = {Basic, Career, , Finance}, and there are 37 personal information attributes, i.e., N = {BirthPlace, BirthDay, , CreditcardNumber}. The cross table shows the relation, marked by an x, between each concept and attributes of the concept. For example, BirthPlace belongs in the Basic and Private concepts while the Basic concept has 15 attributes, i.e., BirthPlace, BirthDay, , DrivingLicenseNumber. For a Web service, its WSDL interface document defines what users personal information is required for the processing of the service. However, different services with similar functionality may name the exchanged data elements differently. A service, for example, may require a data element called Address whereas another requires Addr. In order to infer that the two services require the same personal data, both Address and Addr elements in the two WSDLs can be annotated with the same semantic information. To disambiguate user information that may be named differently by different services, we augment WSDL data elements of a service with semantic annotation using the SAWSDL mechanism [3]. The annotation specifies the meaning of WSDL data elements based on personal information ontology. We represent the personal information concepts and attributes in the cross table (Table II) as an OWL-based personal information ontology as in Fig. 2. The attribute sawsdl:modelReference is associated with a data element in the WSDL document to reference to a semantic term in the ontology. In the WSDL of the Register service in Fig. 3, the meaning of the data element called Name is the term PersonName in the ontology in Fig. 2, etc. Semantic annotation is useful for automating privacy measurement and facilitates comparison of privacy quality of different services.

1 2 3 4 5 6 7 8 9 10

Collection Use and Disclosure Data Quality Data Security Openness Access and Correction Identifiers Anonymity Transborder Data Flows Sensitive Information

IV.

ASSESSMENT OF SENSITIVITY LEVEL OF PERSONAL INFORMATION

The motivation for assessing sensitivity level of personal information is that, for different Web services with similar functionality, a service user would prefer one to which disclosure of personal information is limited. It is therefore desirable that less number of personal data items is required by the service and the data items that are required are also less sensitive. We adapt from the approach by Jang and Yoo [2] which analyzes sensitivity level of personal information based on personal information classification. A. Formal Concept Analysis and Ontology of Personal Information Jang and Yoo represent personal information classification using a formal concept analysis (FCA) [11]. The formal definition of a data group, i.e., personal information in this case, is given as DG = (G, N, R) where G is a finite set of concepts and can be described as G = {g1, g2, ..., gn},

TABLE II.

CROSS TABLE OF PERSONAL INFORMATION, ADAPTED FROM [2]

147

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

concepts and the disclosure of n may lead to other information belonging in these concepts. The degree of conjunction of n or DC(n) is determined by (3): number of concepts in which nbelongs (3) DC (n) . total number of concepts For example, from Table II, PersonName is associated with 5 out of 7 concepts, i.e., Basic, Career, Health, School, and Finance. Therefore DC(PersonName) = 5/7. 2) Principle of identity of an attribute n indicates that n is an identity attribute of the concept with which it is associated, i.e., n is used as a key information to access other attributes in that concept. Disclosure of n may then lead to more problems than disclosure of other attributes. The principle of identity of n or IA(n) is determined by (4):
0 IA(n) number of attributes in the concepts if n is identity attribute for the concepts. total number of attributes if n is not identity attribute

(4)

Figure 2. Part of personal information ontology.

For example, from Table II, StudentID is an identity attribute (i.e., it belongs in the concept Identity) for the concept School. There are 10 attributes associated with School and there are 37 attributes in total. Therefore IA(StudentID) = 10/37. For HomeAddress, it is not an identity attribute and IA(HomeAddress) = 0. 3) Principle of privacy of an attribute n indicates that n is private information. Note that this is subjective to the service users, e.g., some users may consider Age as private information whereas others may not. We let the service users customize the cross table by specifying which attributes are considered private, i.e., belong in the concept Private. The principle of privacy of n or PA(n) is determined by (5):
0 if n does not belong in the concept Private PA(n) 1 if n belongs in the concept Private.

<xs:element name="RegisterRequest"> <xs:complexType> <xs:sequence> <xs:element name="Name" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#PersonName"/> <xs:element name="Address" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#HomeAddress"/> <xs:element name="MobilephoneNo" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#CellphoneNumber"/> <xs:element name="Email" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#PersonalEmailAddress"/> <xs:element name="StdID" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#StudentID"/> </xs:sequence> </xs:complexType> </xs:element>
Figure 3. Part of semantics-annotated WSDL document.

(5)

For example, from Table II, CellphoneNumber is private and PA(CellphoneNumber) = 1, whereas PersonalEmailAddress is not and PA(PersonalEmailAddress) = 0. 4) Value of analogism of an attribute n indicates that n can be used to derive other attributes. This means the knowledge of n can also reveal other personal information. The value of analogism of n or AA(n) is determined by (6): 0 if n cannot derive other attributes AA(n) (6) 1 if n can derive other attributes. The analogy between attributes has to be defined and associated with the cross table and the personal information ontology. For example, SocialSecurityNumber can derive other attribute such as BirthPlace, and AA(SocialSecurityNumber) = 1, whereas Age cannot and AA(Age) = 0.

B. Sensitivity Level of Personal Information Jang and Yoo [2] address four factors of privacy sensitivity for personal information, i.e. degree of conjunction, principle of identity, principle of privacy, and value of analogism. They also give a guideline to evaluate these sensitivity factors which we can adapt for the work. We define the formula to compute the scores of these factors based on the cross table (Table II) as follows. 1) Degree of conjunction of an attribute (personal data item) n is derived from the number of concepts which the attribute n describes. This means n is associated with these

148

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

All four sensitivity factor scores range between 0-1. Based on these scores, Jang and Yoo suggest that the sensitivity level of an attribute n or SL(n) be determined by (7) [2]: SL(n) = DC(n) + IA(n) + PA(n) + AA(n). (7) We propose to compute the sensitivity level of all personal information exchanged with a Web service using (8):

The privacy quality P of a service is computed by (10). The service user can also define weighted scores and to denote relative importance of the two privacy aspects; and are in [0, 1] and + = 1. The service which complies with the privacy principles and requires less sensitive information would be of high quality with regard to privacy.

SLws SLi
i 1

(8)

P NPcom (1 NSLws ).

(10)

where k = number of exchanged personal data elements SLi = sensitivity level of personal data element i computed by (7). We will later use a normalized NSLws, as in (9), which ranges between 0-1 in our privacy measurement model in Section V:

NSLws
i 1

SLi SLws . 4k 4k

(9)

As an example, given equal weights to the two privacy aspects and the assessment in Tables II and III, the privacy quality of the Register service is P = (0.5)(0.87) + (0.5)(1 - 0.19) = 0.435+0.405 = 0.84. The Register service has high privacy principles compliance level and requires personal data that are relatively not so sensitive. It is therefore desirable in terms of privacy. VI. DEVELOPMENT OF SUPPORTING TOOL

As an example, suppose a Register service of a university requires the following personal information: Name, Address, MobilephoneNo, Email, and StdID. In the WSDL in Fig. 3, these data elements are annotated with semantic terms described in the personal information ontology in Fig. 2. We can determine the sensitivity level of each data element by calculating the sensitivity level of the associated semantic term using (7), and the total sensitivity level of all personal data required by the service using (8) and (9) as in Table III. V. WEB SERVICES PRIVACY MEASUREMENT MODEL

We combine the two privacy aspects in Sections III and IV into a privacy measurement model. The normalized privacy principles compliance NPcom of a service is a positive aspect. A service user would prefer a service with high compliance rating. The service provider is encouraged to follow privacy principles, provide proper management of users personal information, and publish a clear privacy policy that can facilitate compliance rating by the service users. On the contrary, the normalized sensitivity level NSLws for the service is a negative aspect. Using a service which exchanges highly sensitive personal data could mean high risk of privacy violation if these data are disclosed or not protected properly.
TABLE III. Data Element Name Address Mobilephone Number Email StdID EXAMPLE OF SENSITIVITY LEVEL MEASUREMENT Semantic Annotation n PersonName HomeAddress Cellphone Number PersonalEmail Address StudentID DC(n) (3) 5/7 1/7 6/7 3/7 2/7 IA(n) (4) 0 0 0 0 10/37 PA(n) (5) 0 0 1 0 0 AA(n) (6) 0 0 0 0 0 Total SL(n) (7) 0.71 0.14 1.86 0.43 0.56 SLws =3.7 NSLws =3.7/ 4*5 =0.19

Besides the proposed model, we have developed a Webbased tool called a privacy measurement system to support the model. To be able to automate privacy measurement, the tool relies on the service WSDL being annotated with semantic terms described in the personal information ontology. The usage scenario of the privacy measurement system is depicted in Fig. 4 and can be described as follows. 1) The privacy measurement system obtains the cross table and personal information ontology from a privacy domain expert. In the prototype of the tool, the cross table in Table II and a personal information ontology that corresponds to the cross table are used. 2) A service user specifies the Web service to be measured the privacy. Together with the service WSDL URL, the user uses the tool to specify the following: a) Privacy principles compliance rating ri and weight pi for each privacy principle; the user will have to check with the privacy policy of the service in order to rate.

Figure 4. Usage scenario of privacy measurement system.

149

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

b) Personal data attributes that are considered private; these attributes will be associated with the concept Private of the cross table. c) Weights and for the privacy measurement model. The users of the tool could be end users of the services or software designers who are assessing privacy quality of the services to be aggregated in service-based applications. Additionally, service providers may use the tool for selfassessment; the measurement can be used for comparison with competing services and as a guideline for improving privacy protection. 3) The tool imports the WSLD document of the service. It is assumed that the service provider annotates the WSDL based on the personal information ontology. 4) The tool calculates the privacy score of the service and informs the user. As an example, a screenshot reporting privacy measurements of the Register service is shown in Fig. 5.

supporting tool which can automate privacy measurement based on semantic annotation added to WSDL data elements. Generally a service user can consider the privacy score as one of the QoS scores to distinguish services with similar functionality. As discussed earlier, the privacy score is subjective to the users who assess the service. The score may vary depending on how the service provider provides a proof of privacy principles compliance, the expectation of the user when rating the compliance, and the users personal view on private data. Also, the cross table presented in Table II is an example but not intended to be exhaustive. A privacy measurement system can adjust the concepts, attributes, and their relations within the cross table as well as the corresponding personal information ontology. Since the measurement tool makes use of semanticsenhanced WSDLs, a limitation would be that we require the service providers to specify semantics. However, semantic information only helps automate the calculation and the measurement model itself does not rely on semantic annotation. The approach can still be followed and the measurement model can still be used even though WSDL documents are not semantics-annotated. At present, we target privacy of single Web services. The approach can be extended to composite services. We are planning for an empirical evaluation of the model by service users and an experiment with real-world Web services as well as cloud services. REFERENCES
W. D. Yu, S. Doddapaneni, and S. Murthy, A privacy assessment approach for service oriented architecture applications, in Procs. of 2nd IEEE Int. Symp. on Service-Oriented System Engineering (SOSE 2006), 2006, pp. 67-75. [2] I. Jang and H. S. Yoo, Personal information classification for privacy negotiation, in Procs. of 4th Int. Conf. on Computer Sciences and Convergence Information Technology (ICCIT 2009), 2009, pp. 11171122. [3] W3C, Semantic Annotations for WSDL and XML Schema, http://www.w3.org/TR/2007/REC-sawsdl-20070828/, 28 August 2007. [4] W3C, Web Services Architecture Requirements, http://www.w3.org/TR/wsa-reqs/, 11 February 2004. [5] G. Yee, Measuring privacy protection in Web services, in Procs. of IEEE Int. Conf. on Web Services, 2006, pp.647-654. [6] G. O. M. Yee, An automatic privacy policy agreement checker for Eservices, in Procs. of Int. Conf. on Availability, Reliability and Security, 2009, pp. 307-315. [7] W. Xu, V. N. Venkatakrishnan, R. Sekar, and I. V. Ramakrishnan, A framework for building privacy-conscious composite Web services, in Procs. of IEEE Int. Conf. on Web Services,2006, pp. 655-662. [8] M. Tavakolan, M. Zarreh, and M. A. Azgomi, An extensible model for improving the privacy of Web services, in Procs. of Int. Conf. on Security Technology, 2008, pp. 175-179. [9] T. Yu, Y. Zhang, Y., K. J. Lin, Modeling and measuring privacy risks in QoS Web services, in Procs. of 8th IEEE Int. Conf. on E-Commerce Technology and 3rd IEEE Int. Conf. on Enterprise Computing, ECommerce, and E-Services, 2006. [10] R. Hewett and P. Kijsanayothin, On securing privacy in composite web service transactions, in Procs. of 5th Int. Conf. for Internet Technology and Secured Transactions (ICITST09), 2009, pp. 1-6. [11] Uta Priss, Formal Concept Analysis, http://www.upriss.org.uk /fca/ fca.html/, Last accessed: 24 February 2012. [1]

Figure 5. Example of measurements screen.

VII. CONCLUSION This paper presents a privacy measurement model which combines and enhances existing privacy measurement approaches. The model considers both privacy principles compliance and sensitivity level of personal information. The basis of the measurement is the privacy policy published by the service provider and users personal information that is exchanged with the service. The model can be applied even in the absence of any of such information. We present also a

150

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Measuring Granularity of Web Services with Semantic Annotation


Nuttida Muchalintamolee and Twittie Senivongse
Computer Science Program, Department of Computer Engineering Faculty of Engineering, Chulalongkorn University Bangkok, Thailand nuttida.mu@student.chula.ac.th, twittie.s@chula.ac.th

Abstract Web services technology has been one of the mainstream technologies for software development since Web services can be reused and composed into new applications or used to integrate software systems. Granularity or size of a service refers to the functional scope or the amount of detail associated with service design and it has an impact on the ability to reuse or compose the service in different contexts. Designing a service with the right granularity is a challenging issue for service designers and mostly relies on designers judgment. This paper presents a granularity measurement model for a Web service with semantics-annotated WSDL. The model supports different types of service design granularity, and semantic annotation helps with the analysis of the functional scope and amount of detail associated with the service. Based on granularity measurement, we then develop a measurement model for service reusability and composability. The measurements can assist in service design and the development of service-based applications. Keywords-service granularity; measurement; composability; semantic Web services; ontology reusability;

context. (2) Capability granularity refers to the functional scope of a specific capability (or operation). (3) Data granularity is the amount of data to be exchanged in order to carry out a capability. (4) Constraint granularity is the amount of validation constraints associated with the information exchanged by a capability. Different types of granularity impacts on service reusability and composability in different ways. Erl differentiates between these two terms. Reusability is the ability to express agnostic logic and be positioned as a reusable enterprise resource, whereas composability is the ability to participate in multiple service composition [1]. A coarse-grained service with a broad functional context should be reusable in different situations while a fine-grained service capability can be composable in many service assemblies. Coarse-grained data exchanged by a capability could be a sign that the capability has a large scope of work and should be good for reuse while a capability with very fine-grained (detailed) data validation constraints should be more difficult to reuse or compose in different contexts with different data formats. Inappropriate granularity design affects not only reusability and composability but also performance of the service. Fine-grained capabilities, for example, may incur invocation overheads since many calls have to be made to perform a task [2]. Designing a service with the right granularity is a challenging issue for service designers and mostly relies on designers judgment. To help determine service design granularity, we present a granularity measurement model for a Web service with semantics-annotated WSDL. The model supports all four types of granularity and semantic annotation is based on the domain ontology of the service which is expressed in OWL [3]. The motivation is semantic annotation should give more information about functional scope of the service and other detail which would help to determine granularity more precisely. Semantic concepts from the domain ontology can be annotated to different parts of a WSDL document using Semantic Annotation for WSDL and XML Schema (SAWSDL) [4]. Based on granularity measurement, we then develop a measurement model for service reusability and composability. Section II of the paper discusses related work. Section III introduces a Web service example which will be used throughout the paper. The granularity measurement model and

I.

INTRODUCTION

Web Services technology has been one of the mainstream technologies for software development since it enables rapid flexible development and integration of software systems. The basic building blocks are Web services which are software units providing certain functionalities over the Web and involving a set of interface and protocol standards, e.g. Web Service Definition Language (WSDL) as a service contract, SOAP as a messaging protocol, and Business Process Execution Language (WS-BPEL) as a flow-based language for service composition [1]. The technology promotes service reuse and service composition as the functionalities provided by a service should be reusable or composable in different contexts of use. Granularity of a service impacts on its reusability and composability. Erl [1] defines granularity in the context of service design as the level of (or absence of) detail associated with service design. The service contract or service interface is the primary concern in service design since it represents what the service is designed to do and gives detail about the scope or size of it. Erl classifies four types of service design granularity: (1) Service granularity refers to the functional scope or the quantity of potential logic the service could encapsulate based on its

151

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

the reusability and composability measurement models are presented in Sections IV and V. Section VI gives an evaluation of the models and the paper concludes in Section VII. II. RELATED WORK

Several research has addressed the importance of granularity to service-oriented systems. Haesen et al. [5] proposes a classification of service granularity types which consists of data granularity, functionality granularity, and business value granularity. Their impact on architectural issues, e.g., reusability, performance, and flexibility, is discussed. In their approach, the term service refers more to an operation rather than a service with a collection of capabilities as defined by Erl. Feuerlicht [6] discusses that service reuse is difficult to achieve and uses composability as a measure of service reuse. He argues that granularity of services and compatibility of service interfaces are important to composability, and presents a process of decomposing coarse-grained services into finegrained services (operations) with normalized interfaces to facilitate service composition. On granularity measurement, Shim et al. [7] propose a design quality model for SOA systems. The work is based on a layered model of design quality assessment. Mappings are defined between design metrics, which measure service artifacts, and design properties (e.g., coupling, cohesion, complexity), and between design properties and high-level quality attributes (e.g., effectiveness, understandability, reusability). Service granularity and parameter granularity are among the design properties. Service granularity considers the number of operations in the service system and the similarity between them (based on similarity of their messages). Parameter granularity considers the ratio of the number of coarse-grained parameter operations to the number of operations in the system. Our approach is inspired by this work but we focus only on granularity measurement for a single Web service, not on system-wide design quality, and will link granularity to reusability and composability attributes. We notice that their granularity measurement relies on the designers judgment, e.g., to determine if an operation has finegrained or coarse-grained parameters. We thus use semantic annotation to better understand the service. Another approach to granularity measurement is by Alahmari et al. [8]. They propose metrics for data granularity, functionality granularity, and service granularity. The approach considers not only the number of data and operations but also their types which indicate whether the data and operations involve complicated logic. The impact on service operation complexity, cohesion, and coupling is discussed. Khoshkbarforoushha et al. [9] measure reusability of BPEL composite services. The metric is based on analyzing description mismatch and logic mismatch between a BPEL service and requirements from different contexts of use. III. EXAMPLE

with semantic descriptions. The figure shows the use of SAWSDL tags [4] to reference to the semantic concepts in a service domain ontology to which different parts of the WSDL correspond. Here the meaning of the data type named ProductInfo is the term ProductInfo in the domain ontology OnlineBooking in Fig. 2, and the meaning of the operation named viewProduct is the term SearchProductDetail. IV. GRANULARITY MEASUREMENT MODEL

Granularity measurement considers the schema and semantics of the WSDL description. Semantic granularity is determined first and then applied to different granularity types.
<?xml version="1.0" encoding="UTF-8"?> <wsdl:description targetNamespace="http://localhost:8101/GranularityMeasurement/ wsdl/OnlineBooking#" xmlns="http://localhost:8101/GranularityMeasurement/wsdl/ OnlineBooking#" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:wsdl="http://www.w3.org/ns/wsdl" xmlns:sawsdl="http://www.w3.org/ns/sawsdl"> <wsdl:types> <xs:schema targetNamespace="http://localhost:8101/ GranularityMeasurement/wsdl/OnlineBooking#" elementFormDefault="qualified"> <xs:element name="viewProductReq" type="productId"/> <xs:element name="viewProductRes" type="productInfo"/> <xs:simpleType name="productId"> <xs:restriction base="xs:string"> <xs:pattern value="[0-9]{4}"/> </xs:restriction> </xs:simpleType> <xs:complexType name="productInfo" sawsdl:modelReference="http://localhost:8101/Granularity Measurement/ontology/OnlineBooking#ProductInfo"> <xs:sequence> <xs:element name="productName" type="xs:string"/> <xs:element name="productType" type="productType"/> <xs:element name="description" type="xs:string"/> <xs:element name="unitPrice" type="xs:float"/> </xs:sequence> </xs:complexType> <xs:simpleType name="productType"> <xs:restriction base="xs:string"> <xs:pattern value="[A-Z]"/> </xs:restriction> </xs:simpleType> </xs:schema> </wsdl:types> <wsdl:interface name="OnlineBookingWSService" sawsdl:modelReference="http://localhost:8101/Granularity Measurement/ontology/OnlineBooking#OrderManagement"> <wsdl:operation name="viewProduct" pattern="http://www.w3.org/ns/wsdl/in-out" sawsdl:modelReference="http://localhost:8101/Granularity Measurement/ontology/OnlineBooking#SearchProductDetail"> <wsdl:input element="viewProductReq"/> <wsdl:output element="viewProductRes"/> </wsdl:operation> </wsdl:interface>
</wsdl:description>

An online booking Web service will be used to demonstrate our idea. It provides service for any product booking and includes several functions such as viewing product information and creating and managing booking. Fig. 1 shows the WSDL 2.0 document of the service. Suppose the WSDL is enhanced

Figure 1. WSDL of online booking Web service with SAWSDL annotation.

152

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" <owl:Ontology /> <owl:ObjectProperty rdf:ID="part"/> <owl:Class rdf:ID="OrderManagement" /> <owl:Class rdf:ID="ProductInfo" /> <owl:Class rdf:ID="HotelInfo" > <rdfs:subClassOf rdf:resource="#ProductInfo" /> </owl:Class> <owl:Class rdf:ID="ProductName" > <rdfs:subClassOf rdf:resource="#Name" /> </owl:Class> <owl:FunctionalProperty rdf:ID="hasProductID"> <rdfs:subPropertyOf rdf:resource="#part"/> <rdfs:domain rdf:resource="#ProductInfo" /> <rdfs:range rdf:resource="#ID" /> <rdf:type rdf:resource="&owl;ObjectProperty" /> </owl:FunctionalProperty> <owl:FunctionalProperty rdf:ID="hasProductName"> <rdfs:subPropertyOf rdf:resource="#part"/> <rdfs:domain rdf:resource="#ProductInfo" /> <rdfs:range rdf:resource="#ProductName" /> <rdf:type rdf:resource="&owl;ObjectProperty" /> </owl:FunctionalProperty> <owl:FunctionalProperty rdf:ID="hasProductPrice"> <rdfs:subPropertyOf rdf:resource="#part"/> <rdfs:domain rdf:resource="#ProductInfo" /> <rdfs:range rdf:resource="#Price" /> <rdf:type rdf:resource="&owl;ObjectProperty" /> </owl:FunctionalProperty> <owl:FunctionalProperty rdf:ID="hasProductType"> <rdfs:subPropertyOf rdf:resource="#part"/> <rdfs:domain rdf:resource="#ProductInfo" /> <rdfs:range rdf:resource="#Type" /> <rdf:type rdf:resource="&owl;ObjectProperty" /> </owl:FunctionalProperty> <owl:Class rdf:ID="SearchProductDetail" /> <owl:Class rdf:ID="SearchProductInfo" > <rdfs:subClassOf rdf:resource="#SearchProductDetail" /> </owl:Class> <owl:Class rdf:ID="SearchRelatedProductInfo" > <rdfs:subClassOf rdf:resource="#SearchProductDetail" /> </owl:Class> <owl:Class rdf:ID="GetProductUpdate" /> <owl:Class rdf:ID="GetProductPriceUpdate" /> <owl:FunctionalProperty rdf:ID="hasGetProductUpdate"> <rdfs:subPropertyOf rdf:resource="#part"/> <rdfs:domain rdf:resource="#SearchProductDetail" /> <rdfs:range rdf:resource="#GetProductUpdate" /> <rdf:type rdf:resource="&owl;ObjectProperty" /> </owl:FunctionalProperty> <owl:FunctionalProperty rdf:ID="hasGetProductPriceUpdate"> <rdfs:subPropertyOf rdf:resource="#part"/> <rdfs:domain rdf:resource="#SearchProductDetail" /> <rdfs:range rdf:resource="#GetProductPriceUpdate" /> <rdf:type rdf:resource="&owl;ObjectProperty" /> </owl:FunctionalProperty> </rdf:RDF>

Figure 3. Semantic granularity of ProductInfo and related terms.

Semantic Granularity(t ) = no.of terms under t in either class-subclass relation

(1)

or whole-part relation,including itself

Using (1), Fig. 3 shows semantic granularity of the semantic term ProductInfo and its related terms with respect to class-subclass and whole-part property relations. When an ontology term is annotated to a WSDL part, it transfers its semantic granularity to the WSDL part. B. Constraint Granularity A service capability (or operation) needs to operate on correct input and output data, so constraints are put on the exchanged data for a validation purpose. Constraint granularity considers the number of control attributes and restrictions (not default) that are assigned to the schema of WSDL data, e.g., Attribute of <xs:element/> such as fixed, nullable, maxOccur and minOccur <xs:restriction/> which contains a restriction on the element content.

Constraint granularity R of a capability o is computed by (2):

Ro = Constraint ij
i=1 j=1

mi

(2)

Figure 2. A part of domain ontology for online booking (in OWL).

where n = the number of parameters of the operation o A. Semantic Granularity When a part of WSDL is annotated with a semantic term, we determine the functional scope and amount of detail associated with that WSDL part through the semantic information that can be derived from the annotation. Classsubclass and whole-part property are semantic relations that are considered. Class-subclass is a built-in relation in OWL but whole-part is not. We define an ObjectProperty part (see Fig. 2) to represent the whole-part relation, and any whole-part relation between classes will be defined as a subPropertyOf part. Then, semantic granularity of a term t which is in a classsubclass/whole-part relation is computed by (1): mi = the number of elements/attributes of ith parameter Constraintij = the number of constraints of an element/ attribute of a parameter . In Fig. 1, the operation viewProduct has two constraints on two out of five input/output data elements, i.e., constraints on productId and productType. So its constraint granularity is 2. C. Data Granularity A WSDL document normally describes the detail of the data elements, exchanged by a service capability, using the XML schema in its <types> tag. With semantic annotation to a

153

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

data element, semantic detail is additionally described. If the semantic term is defined in a class-subclass relation (i.e., it has subclasses), then the term will transfer its generalization, encapsulating several specialized concepts, to the data element that it annotates. If the semantic term is defined in a whole-part relation (i.e., it has parts), it will transfer its whole concept, encapsulating different parts, to the data element that it annotates. For a data element with no sub-elements (i.e., lowest-level element), we determine its granularity DGLE by its classsubclass and whole-part relations. For whole-part, if the element has an associated whole-part semantics, we determine the parts from the semantic term; otherwise the part is 1, denoting the lowest-level element itself (see (3)). For a data element with sub-elements, we compute its granularity DGE by a summation of the data granularity of all its immediate subelements DGSE together with the semantic granularity of the element itself (see (4)). Note that (4) is recursive. Finally, for data granularity Do of a capability o, we compute a summation of data granularity of all parameter elements (see (5)).

elements or semantic annotation, so its granularity as a DGLE is 1 as well. In Fig. 3, the semantic term ProductInfo has three direct subclasses and three indirect subclasses as well as four parts. The granularity of the output data viewProductRes as a DGE would be 16 (i.e., ((1+1+1+1)+7+5). Therefore data granularity Do of the operation viewProduct is 17 (1+16).
D. Capability Granularity The functional scope of a service capability can be derived from data granularity and semantic annotation. If large data are exchanged by the capability, it can be inferred that the capability involves a big task in the processing of such data. We can additionally infer that the capability is broad in scope if its semantics involves other specialized functions (i.e., having a class-subclass relation) or other sub-tasks (i.e., having a wholepart relation). Capability granularity Co of a capability o is then computed by (6):

Co = Do + aco + apo
where Do = data granularity of the operation o

(6)

DGLE = acp + max(1, app )

(3)

aco = semantic granularity in the class-subclass relation

of the operation o, computed by (1)


DGE = DGSEj + acp + app
j =1 m

(4)

apo = semantic granularity in the whole-part property

relation of the operation o, computed by (1).


n

Do = DGEi
i =1

(5)

where n = the number of parameters of the operation o


DGE = data granularity of an element with sub-

elements/attributes
m = the number of sub-elements/attributes of an element DGSE = data granularity of an immediate sub-element/

From the previous calculation, data granularity of the operation viewProduct in Fig. 1 is 17. This operation is annotated with the semantic term SearchProductDetail. In Fig. 2, this semantic term is a generalization of two concepts SearchProductInfo and SearchRelatedProductInfo, so the capability viewProduct encapsulates these two specialized tasks. The semantic term SearchProductDetail also comprises two sub-tasks GetProductUpdate and GetProductPriceUpdate in a whole-part relation. Therefore capability granularity of viewProduct is 23 (17+3+3).
E. Service Granularity The functional scope of a service is determined by all of its capabilities together with semantic annotation which would describe the scope of use of the service semantically. Service granularity Sw of a service w is computed by (7):

attribute of an element
DGLE = data granularity of a lowest-level element/

attribute
acp = semantic granularity in the class-subclass relation

of an element/attribute, computed by (1)


app = semantic granularity in the whole-part property

Sw = Coi + acw + apw


i =1

(7)

relation of an element/ attribute, computed by (1). In Fig. 1, the input viewProductReq of the operation viewProduct has no sub-elements or semantic annotation, so its granularity as a DGLE is 1 (0+max(1, 0)). In contrast, the output viewProductRes is of type productInfo which is also annotated with the ontology term ProductInfo. From the schema in Fig. 1, this output has four sub-elements (productName, productType, description, unitPrice). Each sub-element has no further sub-

where k = the number of operations of the service w


Co = capability granularity of an operation o acw = semantic granularity in the class-subclass relation

of the service w, computed by (1)


apw = semantic granularity in the whole-part property

relation of the service w, computed by (1).

154

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

In Fig. 1, the online booking service is associated with the semantic term OrderManagement. Suppose the term OrderManagement has no subclasses but comprises eight concepts (i.e., parts) in a whole-part property relation. So its service granularity is the summation of capability granularity of the operation viewProduct (i.e., 23), capability granularity of all other operations, and semantic granularity in class-subclass and whole-part property relations (i.e., 1+9). It is seen from the granularity measurement model that semantic annotation helps complement granularity measurement. For the case of the operation viewProduct, for example, the granularity of its capability can only be inferred from the granularity of its data if the operation has no semantic annotation. However, by annotating this operation with the generalized term SearchProductDetail, we gain knowledge about its broad scope such that its capability encapsulates both specialized SearchProductInfo and SearchRelatedProductInfo tasks. The additional information refines the measurement. V. REUSABILITY AND COMPOSABILITY MEASUREMENT MODELS

A. Reusability Model Reusability measurement is derived from the impact of granularity. It can be seen that different types of granularity measurement relate to each other. That is, service granularity is built on capability granularity which in turn is built on data granularity, and they all have a positive impact. So we consider only service granularity in the model since the effects of data granularity and capability granularity are already part of service granularity. The negative impact of constraint granularity is incorporated in the model (8):

Reusability = Sw Roi
i =1

(8)

where Sw = service granularity of the service w


Ro = constraint granularity of the operation o k = the number of operations of the service w.

A coarse-grained service with small data constraints has high reusability.


B. Composability Model In a similar manner, we consider only capability granularity and constraint granularity in the composability model because the effects of data granularity are already part of capability granularity. Since they all have a negative impact, we represent composability measure in the opposite meaning. We define a term uncomposabilty to represent an inability of a service operation to be composed in service assembly (9):

As mentioned in Section I, reusability is the ability to express agnostic logic and be positioned as a reusable enterprise resource, whereas composability is the ability to participate in multiple service composition. We see that reusability is concerned with putting a service as a whole to use in different contexts. Composability is seen as a mechanism for reuse but it focuses on assembly of functions, i.e., it touches reuse at the operation level, rather than the service level. We follow the method in [7] to first identify the impact the granularity has on reusability and composability attributes and then derive measurement models for them. Table I presents impact of granularity. For reusability, a coarse-grained service with a broad functional context providing several functionalities should be reused well as it can do many tasks serving many purposes. Coarse-grained data, exchanged by an operation, could be a sign that the operation has a large scope of work and should be good for reuse as well. So we define a positive impact on reusability for coarse-grained data, capabilities, and services. For composabilty, we focus at the service operation level and service granularity is not considered. A small operation doing a small task exchanging small data should be easier to include in a composition since it does not do too much work or exchange excessive data that different contexts of use may require or can provide. So we define a negative impact on composability for coarse-grained capabilities and data. For constraints on data elements, the bigger number of constraints means finer-grained restrictions are put on the data; they make the data more specific and may not be easy for reuse, hence a negative impact on both attributes.
TABLE I. Granularity Type Service Granularity Capability Granularity Data Granularity Constraint Granularity IMPACT OF GRANULARITY ON REUSE Reusability Composability

Uncomposability = Co + Ro
where Co = capability granularity of the operation o
Ro = constraint granularity of the operation o.

(9)

A fine-grained capability with small data constraints has low uncomposability, i.e. high composability. VI. EVALUATION

We apply the measurement models to two Web services. The first one is the online booking Web service which we have used to demonstrate the idea. It is a general service including a large number of small data and operations. Its scope covers viewing, managing, and booking products. Another Web service is an online order service which has only a bookingrelated function. The two Web services are annotated with semantic terms from the online booking ontology which describes detail about processes and data in the online booking domain. Table II shows details of some operations of the two services including their capabilities, data, and semantic annotation. For the evaluation, a granularity measurement tool is developed to automatically measure granularity of Web services. It is implemented using Java and Jena [10] which helps with ontology processing and inference of relations.

155

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Table III presents granularity measurements and reusability scores. The online booking service is coarser and has higher reusability. It is a bigger service with wider range of functions, exchanging more data, and having a number of data constraints. It is likely that the online booking service can be put to use in various contexts. On the other hand, the online order service is finer-grained focusing on order management. The two services are annotated with semantic terms of the same ontology, and additional semantic detail helps refine their measurements. Table IV presents granularity measurements and uncomposability of the operations annotated with the semantic term UpdateOrder. The operation editOrderItem of the online order service has coarser data and capability compared to the three finer-grained operations of the online booking service, and therefore it is less composable. VII. CONCLUSION This paper explores the application of semantics-annotated WSDL to measuring design granularity of Web services. Four types of granularity are considered together with semantic granularity. The models for reusability and composability (represented by uncomposability) are also introduced. As explained in the example, semantic annotation can help us derive the functional contexts and concepts that the service, capability, and data element encapsulate. Granularity measurement which is traditionally done by analyzing the size of capability and data described in standard WSDL and XML schema documents can be refined and better automated.

TABLE III. Service Name OnlineBookingWSService OnlineOrderWSService

GRANULARITY AND REUSABILITY


Ro 48 10

Granularity Do Co 143 184 47 62

Sw 194 72

Reusability Sw - Ro 146 62

TABLE IV. SERVICE GRANULARITY AND UNCOMPOSABILITY OF OPERATIONS ANNOTATED WITH UPDATEORDER Service Name Online Booking WSService Operation Name addProduct ToCart DeleteProduct FromCart editProduct Quantity InCart editOrderItem Ro 4 Granularity Do Co Sw 15 18 Uncomposability Co + Ro 22

3 4

14 15

17 18

20 22

Online Order WSService

19

22

25

For future work, we aim to refine the domain ontology and WSDL annotation. It would be interesting to see the effect of annotation on granularity, reusability, and composability when the WSDL contains a lot of annotations compared to when it is less annotated. Since annotation can be made to different parts of WSDL, the location of annotations can also affect granularity scores. Additionally we will try the models with Web services in business organizations and extend the models to apply to composite services. REFERENCES

TABLE II. Operation


Name

PART OF SERVICE DETAIL AND SEMANTIC ANNOTATION Input Data Type


Name Annotation

Output Data Type


Name Annotation

Annotation

Online booking web service newCart Insert userId Order addProduct Update addProduct ToCart Order delete Update delete Product Order Product FromCart editProduct Update editProduct Quantity Order Quantity InCart view Search orderId Product OrderItem InCart ByOrderID reservation EditOrder reserved Order Online order web service createOrder Create order Order Request edit Update editOrder OrderItem Order ItemInfo submit EditOrder orderId Order

ID OrderItem OrderItem

orderId process Result process Result process Result orderItem List process Result order Response orderItem Response order Response

ID Status Status

OrderItem

Status

ID

ID

Status

Order Order ID

Status Status Status

T. Erl, SOA: Principle of Service Design, Prentice Hill, 2007. T. Senivongse, N. Phacharintanakul, C. Ngamnitiporn, and M. Tangtrongchit, A capability granularity analysis on Web service invocations, in Procs. of World Congress on Engineering and Computer Science 2010 (WCECS 2010), 2010, pp. 400-405. [3] W3C (2004, February 10) OWL Web Ontology Language Overview [Online]. Available: http://www.w3.org/TR/2004/REC-owl-features20040210/ [4] W3C (2007, August 28) Semantic Annotations for WSDL and XML Schema [Online]. Available: http://www.w3.org/TR/2007/REC-sawsdl20070828/ [5] R. Haesen, M. Snoeck, W. Lemahieu and S. Poelmans, On the definition of service granularity and its architectural impact, in Procs. of 20th Int. Conf. on Advanced Information Systems Engineering (CAiSE 2008), LNCS 5074, 2008, pp. 375-389. [6] G. Feuerlicht, Design of composable services, in Procs. of 6th Int. Conf. on Service Oriented Computing (ICSOC 2008), LNCS 5472, 2008, pp. 15-27. [7] B. Shim, S. Choue, S. Kim and S. Park, A design quality model for service-oriented architecture, in Procs. of 15th Asia-Pacific Software Engineering Conference (APSEC 2008), 2008, pp. 403-410. [8] S. Alahmari, E. Zaluska, D. C. De Roure, A metrics framework for evaluating SOA service granularity, in Procs. of IEEE Int. Conf. on Service Computing (SCC 2011), 2011, pp. 512-519. [9] A. Khoshkbarforoushha, P. Jamshidi, F. Shams, A metric for composite service reusability analysis, in Procs. of the 2010 ICSE Workshop on Emerging Trends in Software Metrics (WETSoM 2010), 2010, pp. 6774. [10] Apache Jena [online]. Available: http://incubator.apache.org/jena/, Last accessed: January 30, 2012.

[1] [2]

156

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Decomposing ontology in Description Logics by graph partitioning


Thi Anh Le PHAM
Faculty of Information Technology Hanoi National University of Education Hanoi, Vietnam lepta@hnue.edu.vn

Nhan LE-THANH
Laboratory I3S Nice Sophia-Antipolis University Nice, France nhan.le-thanh@unice.fr

Minh Quang NGUYEN Faculty of Information Technology Hanoi National University of Education Hanoi, Vietnam quangnm@hnue.edu.vn

Abstract In this paper, we investigate the problem of decomposing an ontology in Description Logics (DLs) based on graph partitioning algorithms. Also, we focus on syntax features of axioms in a given ontology. Our approach aims at decomposing the ontology into many sub ontologies that are as distinct as possible. We analyze the algorithms and exploit parameters of partitioning that influence the efficiency of computation and reasoning. These parameters are the number of concepts and roles shared by a pair of sub-ontologies, the size (the number of axioms) of each sub-ontology, and the topology of decomposition. We provide two concrete approaches for automatically decomposing the ontology, one is called minimal separator based partitioning, and the other is eigenvectors and eigenvalues based segmenting. We also tested on some parts of used TBoxes in the systems FaCT, Vedaall, tambis, ... and propose estimated results. Keywords- Graph partitioning; ontology decomposition; image segmentation I. INTRODUCTION

Our computational analysis of reasoning algorithms guides us to suggest the parameters of such decomposition: the number of concepts and roles included in the semantic mappings between partitions, the size of each component ontology (the number of axioms in each component) and the topology of the decomposition graph. There are two decomposition approaches based on two ways of presenting the ontology. One presents the ontology by a symbols graph, which implements decomposition by minimal separator and the other uses axiom graph, corresponding to the image segmentation method. The rest of the paper is organized as follows. Section 2 proposes a definition of G-decomposition methodology that is based on graph and summarizes some principal steps. In this section, we also recall the criteria for a good decomposition. Sections 3 and 4 describe two ways for transforming an ontology into an undirected graph (symbol graph or weighted graph), as well as two partitioning algorithms of the obtained graph. Section 5 presents some evaluations of the effects of the decomposition algorithms and experimental results. Finally, we provide some conclusions and future work in section 6.

The previous studies about DL-based ontologies focus on the tasks such as ontology design, ontology integration and ontology deployment, Starting from the fact that one wants to effectively solve with a large ontology, instead of the ontology integrating we examine the ontology decomposing. There were some investigations in decomposition of DL ontologies as decomposition-based module extraction [3] or based on syntax structure of ontology [1]. The previous paper [8] shown the executions on the supposition that there exists an ontology (TBox) decomposition called overlap decomposition. This decomposition resulted in preservation of semantic and inference results with respect to original TBox. Our aim is to establish the theoretical foundations for decomposition methods that improves the efficiency of reasoning and guarantees the properties proposed in [7]. The automatic decomposition of a given ontology is an optimal step in ontology design that is supported by graph theory. The graph theory provided the good properties that adapt necessary requirements of our decomposition.

II. G-DECOMPOSITION METHODOLOGY In this paper, ontology decomposition will be considered only from terminological level (TBox). We research some methods that decompose a given TBox into several subTBoxes. For simple, now a TBox can be briefly presented by the single set of axioms A, so we will present the set of axioms by graph. Our goal is to eliminate general concept inclusions (GCIs), a general type of axiom, as much as possible from a general ontology (presented by a TBox) by decomposing the set of GCIs of this ontology into several subsets of GCIs (presented by a distributed TBox). In this paper, we only consider the syntax approach based on the structures of GCIs. We recall some criteria of a good decomposition [8]: All the concepts, roles and axioms of the original ontology are kept through the decomposition.

157

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

The number of axioms in the sub-TBoxes is equivalent.

As a result, we propose two techniques for decomposing based on graphs. G-decomposition is an ontology decomposition method that applies graph partitioning techniques. This graph decomposition is presented by an intersection graph (decomposition graph) in which each vertex is a sub-graph and the edges present the connections of each pair of vertices. In [8] we defined an overlap decomposition of a TBox, it is presented by a distributed TBox (decomposed TBox) that consists of a set of sub-TBoxes and a set of links between these sub-TBoxes (semantic mappings). We assume that readers are familiar with basic concepts in graph theory. Consequently, we propose an ontology decomposition method as a process contain three principal phases (illustrated in the table 1): transform a TBox into a graph, decompose the graph into sub-graphs, transform these sub-graphs into a distributed TBox. We present a general algorithm of Gdecomposition.
TABLE 1. A GENERAL ALGORITHM OF G-DECOMPOSITION

Definition 1 (symbol graph): A graph G = (V, E), where V is a set of vertices et E is a set of edges, is called a symbol graph of T (A) if each vertex v V is a symbol of Ex(A), each edge e = (u, v) E if u, v are in the same axiom of A. So, given a set of axioms A, we can build a symbol graph G = (V, E) by taking each symbol in Ex(A) as a vertex and connecting two vertices by an edge if its symbols are in the same axiom of A. Follow this method, each axiom is presented as a clique in the symbol graph. Example 1: Given a TBox as follows (figure 1):

PROCEDURE DECOMP-TBOX (T = (C, R, A)) T = {C, R, A} is a TBox, with the set of concepts C, the set of roles R, and the set of axioms A (1) TRANS-TBOX-GRAPH (T = (C, R, A)) Build a graph G = (V, E) of this TBox, where each vertex v V is a concept in C or a role in R (or an axiom in A), each edge e = (u, v) E if u and v appear in the same axiom (or u and v have at least a common concept (role)) (2) DECOMP-GRAPH (G = (V, E )) Decompose the obtained graph G = (V, E) in the procedure TRANS-TBOX-GRAPH into an intersection graph G0 = (V0, E0), with each vertex vV0 is a sub-graph, each edge e = (u, v)E0 if u and v are linked. (3) TRANS-GRAPH-TBOX (G0 = (V0, E0) Transform the graph G = (V0, E0) into a distributed TBox, each vertex (sub-graph) corresponds to a sub-TBox, and edges of E0 correspond to semantic mappings. In the next sections, we will introduce the detail techniques for the steps (1) and (2). III. DECOMPOSITION BASED ON MINIMAL SEPARATOR
Figuer 1. TBox Tfam

The set of primitive concepts and roles of Tfam: Ex(Tfam) = {C1; C2; C3; C4; C5; C6; X; Y; T; H}. The figure 2 presents the symbol graph for Tfam

Figure 2. Symbol graph presenting TBox Tfam

Result of this decomposition is presented by a labeled graph (intersection graph or decomposition graph) Gp = (Vp, Ep). Assume that the graph G representing a TBox T is divided by n sub-graphs Gi, follows:
in,

then a decomposition graph is defined as

Definition 2 (decomposition graph) [4]: Decomposition graph is a labeled graph Gp = (Vp, Ep) in which each vertex v Vp is a partition (sub-graph) Gi, each edge eij = (vi, vj) Ep is marked by the set of shared symbols of Gi and Gj, where i j, i, j n. Definition 3 ((a,b) minimal vertex separators)[3]: A set of vertices S is called (a,b) - vertex separator if {a,b} V\S and all paths connecting a and b in G pass through at least one vertex in S. If S is an (a,b) - vertex separator and not contains another (a,b) - vertex separator then S is an (a,b) - minimal vertex separator.

A. Symbol graph A set of axioms A of TBox T, then Ex(A) is the set of concepts and roles that appear in the expressions of A. For simple, we use the notation of symbol instead of concept (role), i.e., a symbol is a concept (role) in TBox. A graph presenting TBox will be defined as follow:

158

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

B. Algorithm We present a recursive algorithm using Even algorithm [4] to find the sets of vertices that divide a graph into partitions. It takes a symbol graph G = (V, E) (presenting a TBox) as input and returns a decomposition graph and the set of separate subgraphs. The idea of algorithm is to look for a connecting part of graph (cut), compute the minimal separator of graph, and then the graph is carved by this separator. Initially, the TBox T is considered as a large partition, and it will be cut into two parts in each recursive iteration. The steps of algorithm are summarized as follows:
TABLE 2. AN ALGORITHM FOR TRANSFORMING THE TBOX INTO GRAPH

(2) Find global minimal vertex separator S* between all vertices of G (3) Decomposing G by S* in two sub-graphs G1, G2, where S* is in all G1 and G2 (4) Generate an undirect graph Gp = (Vp, Ep), where Vp = {G1, G2} and Ep = S*. The algorithm describes a method to list all of the (a, b) vertex minimal separators from a pair of non-adjacent vertices by best-first search technique can be seen in [6]. Tfam (figure 1) can be presented by an undirect adjacent graph (figure 2), where the vertices of this graph correspond to the symbols, the edges of the graph connecting the two vertices corresponding to two symbols of the same axiom. Therefore each axiom would be represented as a clique.

Input: TBox T (A), M limit number of symbols in a part (a subTBox Ti ). Output: Gp = (Vp, Ep), and {Ti }. PROCEDURE DIVISION-TBOX (A , M ) (1) Transform A into a symbol graph G (V, E) with V = Ex (A), E = {(l1, l2)|AA, l1, l2 Ex (A)}. /A is an axiom in A (2) Let Gp = (Vp, Ep) a undirect graph with Vp = {{V}} and Ep = . (3) Call DIVISION-GRAPH(G, M, nil, nil). (4) For each v Vp, let {Tv = {A A |Ex(A) v}. Return Tv, v Vp and Gp.

Figure 3. Decomposing result of symbol graph of Tfam with minimal separator S*= {X} and S* ={Y }

The procedure DIVISION-GRAPH takes the input consisting of a symbol graph G = (V, E) of T, a limited parameter M and two vertices a, b that are initially assigned to nil. This procedure updates the global variable Gp for presenting the decomposition process. In each recursive call, it finds a minimal separator of vertices a, b in G. If one of a, b is nil or both are nils, it finds the global minimal separator between all vertices and the non-nil vertex (or all other vertices). This separator cuts the graph G into two parts G1, G2 and the process continues recursively on these parts.
TABLE 3. AN ALGORITHM FOR PARTITIONING SYMBOL GRAPH

If the criterion based on balance of TBox axioms number between components then S*= {X } and S*={Y}. Using S* and S* to decompose the symbol graph, we obtain three symbol groups {C1, C2, C3, X}, {C4, C5, X, Y} and {C6, H, Y, T}. So we get three corresponding TBoxes: T 1 = {A1, A2, A7, A8}, T 2 = {A3, A4, A9, A10} and T3 = {A5, A6}. The number of symbols of S* and S* is 1 (|S * | = |{X}| = 1, |S * | = |{Y}|=1). The cardinality of three TBoxes are respectively N1 = 4; N2 = 4, N3 = 2. In this case, the cardinality of symbols in each TBox is also equivalent.

The image of symbol graph of Tfam after decomposing as in the figure 3. Obtained Result TBoxes T 1 , T 2 and T3 after decomposition preserve all the concepts, roles and axioms of original Tfam. In addition, T 1 and T 2 satisfy the proposed criteria for decomposing. We have executed graph partition algorithm based on minimal separator. This method returns result that satisfies the given properties. All concepts, roles, and axioms are preserved through decomposing. Relations between them are represented by the edges of symbol inter-graph. This method minimizes symbols shared between component TBoxes ensuring independency property of sub-TBoxes. However, to get the

Input: G = (V, E) Output: connection graph Gp = (Vp, Ep) PROCEDURE DIVISION-GRAPH (G, M, a, b) (1) Find the set of minimal vertex separator of G: - Select an pair of non-adjacent arbitrary vertices (a, b) and compute set of (a, b) minimal separator Repeat this process on every pair of non-adjacent vertices x, y

159

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

result TBoxes, requiring transfer the obtained graphs into the sets of axioms for the corresponding TBoxes. IV. A. DECOMPOSITION BASED ON NORMALIZED CUT

reduce to minimize not only NCut but also to maximize Nassoc in the partitions. It is easy to see that Ncut(A, B)= 2 Nassoc(A, B). This is an important property of decomposition. Because of two obtained criteria ((2) and (3)) from the decomposition algorithm, minimizing the dissociation between the parts and maximizing the association in each part, are identical in reality and can be satisfied simultaneously. Unfortunately, minimizing the normalized cut is exactly complete-NP, even for the particular case of graph on grids. However, the authors in [5] also indicated that if the normalized cut problem is extended in the real value domain, then an approximate solution can be efficiently found. C. Algorithm Given a N-dimensional vector x, N =|V|, where xi = 1 if the node i is in A and xi = -1 if the node i is not in A. Let di = w(i, j) the total of connections from i to all other nodes. j

Axiom graph

In this section, we propose another decomposition technique based on axiom graph that is defined as follow: Definition 4 (axiom graph): A weight undirect graph G = (V, E), where V is a set of vertices and E is a set of edges with the weight values, is called an axiom graph if each vertex v V is an axiom in TBox T, each edge e = (u, v) E if u, v V and there is at least a shared symbol between u and v, and the weight on e (u, v) is a value presenting the similarity between the vertices u and v. By using only the common symbols between each pair of axioms, we can simple define a weight function p: V x V R that send a pair of vertices to a real number. In particular, each edge (i, j) is assigned a value wij describing the connection (association) between axioms Ai and Aj as: wij = nij/(ni + nj), where i, j = 1,..,m, i j, m is the number of axioms in T (m = |A|), ni, nj is the symbol number of Ai and Aj respectively, nij is the number of symbols in Ai Aj (nij = |Ai Aj|). B. Normalized cut Ontology decomposition algorithm based on image segmentation is a grouping method using eigenvectors: Assume that G = (V, E) is divided into two separate sets A and B, (A B = V and A B = ) by removing the edges connecting two parts from the original graph. The association between these parts is the total weight of the removed, in the language of graph theory, it is called the cut: cut(A, B) =

Let D be a diagonal matrix N N, d the main diagonal of the matrix, W a symmetric matrix N N with W(i, j) = w i j . Ontology decomposition algorithm based on image segmentation consists of the following steps: 1) Transform a set of axioms A to an axiom graph G = (V, E) with V = {v|vV} and E = {(u,v)|u, vV, w(u,v) = | u v | }.

|u v |

2) Find the minimum value of NCut by solving : (D W)x = Dx to find eigenvectors corresponding to the smallest eigenvalues . 3) Using eigenvectors with the second smallest eigenvector to decompose the graph into two parts. In the ideal case, the eigenvector only obtains two eigenvalues and the value signs propose a graph decomposition method. 4) Implementing the recursive algorithm on the two decomposed subgraphs. The decomposition algorithm of TBox based on normalized cut [5] is illustrated by the procedure DIVISION-TBOX-NC. This procedure takes a TBox T with the set of axioms A as input. It transforms A into an axiom graph G = (V, E), where each axiom Ai of A is a vertex iV, each edge (i, j) E is | Ex(A i ) Ex(A j ) | assigned by a weight w(i,j) = . Then, the | Ex(A i ) Ex(A j ) | process is performed as the procedure TBOX-DIVISION in the figure 3. The DIVISION-TBOX-NC uses the DIVISION-GRAPH-A procedure for dividing the axiom graph presenting T. This procedure takes the axiom graph G as input, compute the matrices W, D. W is a valued matrix NxN with w(i,j) computing as below. D is a diagonal matrix NxN with the values d(i) = iw(i, j) on its diagonal. Then, we resolve the equation (D-W)y = Dy with the constraints yTDe = 0 and

uA, vB w(u, v)

(1)

i.e. the total number of connections between the nodes of A and the nodes of B. Optimal decomposition of graph is not only to minimize this disassociation, but also to maximize the association in every partition. In addition, NCut (normalized Cut) is also used to measure disassociation, denoted by Ncut as follows: Ncut (A, B) = cut(A,B) + cut(A,B) assoc(A,V) assoc(B,V) where, assoc(A, V) = (2)

u A, t V w(u, t)

is the total of

connections from the nodes of A to all nodes of V. Similarly, N assoc is denoted by: Nassoc (A, B) =

assoc( A, A) assoc( B, B) + assoc( A, V ) assoc( B, V )

(3)

where assoc (A, A) and assoc (B, B) are the weight totals of edges in A and in B respectively. The optimal division of graph

160

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

yi{2, -2b}, where e is a vector Nx1 to all ones, to find the smallest eigenvalues. The second smallest eigenvalue is chosen and it is the minimal value as NCut. We take the eigenvector that corresponds to this eigenvalue for dividing G into two parts G1, G2. Finally the GRAPH-DIVISION-A updates the variable Gp as in the method based on minimal separator. This procedure can be performed recursively, in each recursive call on Gi, it finds an eigenvector with the second smallest eigenvalue and the process continues on Gi.
TABLE 4. AN ALGORITHM FOR TRANSFORMING THE TBOX INTO AXIOM GRAPH

The first algorithm minimizes the shared number of symbols (|S*| minimum) and attempts to balance the number of axioms between parts. After the decomposition, a sign to identify the axioms in obtained graph is cliques. However, some cliques do not present any axiom in the fact. Therefore we need a mechanism to determine the axioms. The second algorithm obtains the advantage of preserving axioms. By the value of NCut, we can measure the independence between parts and the dependence between elements in every part. However, to install an efficient algorithm, a weighting function for the edges connecting the nodes of the graph axioms must be given. Selecting of decomposition algorithm is based on structure of original ontology. For example, the second algorithm is used for ontology which has been presented with a lot of symbols while the first algorithm is more suitable to ontology which consists of many axioms. We applied two decomposition algorithms based on the minimal separator and on the normalized cut to divide a TBox. In this section, we summarize some principal modules that are implemented in our experiments. To illustrate our results, we take a TBox extracting from the file tambis.xml in the system FaCT. This TBox, called Tambis1, consists of 30 axioms. - Transform the ontology into a symbol graph: This module reads a file presenting a TBox in XML. The read file is transformed to a symbol graph. The figure 4 describes the symbol graph of Tambis1 TBox with the labeled vertices by the names of concept and role. - Transform the ontology into an axiom graph: This module performs the same function with the above module, it results in an axiom graph. The figure 5 describes the axiom graph of tambis1 TBox with the labeled vertices by the symbols Ai (i = 0, .., 29). - Decompose the ontology based on the minimal separator: decompose an axiom graph to a tree where the leaf nodes are axioms. The figure 6 presents this decomposition for Tambis1. - Decompose the ontology based on the normalized cut: decompose a symbol graph to a tree where the leaf nodes are axioms. The figure 7 presents this decomposition for Tambis1. These two methods return the results that satisfy the proposed properties of our decomposition. All the concepts, the roles and axioms are preserved after the decomposition execution. The axioms and their relationships are well expressed by the symbol graph and the axiom graph. The set of axioms in the original TBox decreases by regular distribution in sub-TBoxes. The decomposition techniques focus on finding a good decomposition. The method based on minimal separator minimizes the number of shared symbols between the components and tries to equalize axiom number in these parties. We need recover the axioms after decomposing. It is possible because the axioms were encoded by cliques in the symbol graph. However, in reality the difficult of this problem is that there are some cliques of symbol graph and of intersection graph that are not exactly axioms. The possible advantage of the decomposition method based on the normalized cut is to conserver the axioms. After

Input: the TBox T with the set of axioms A Output: the decomposition graph Gp = (Vp, Ep) and {T }. PROCEDURE DIVISION-TBOX-NC(A) (1) Transform a set of axioms A to an axiom graph G = (V, E) with V = {v|v A } and E = {(u, v) |u , v V, w (u,v ) =

|u v |} |u v |

(2) Let Gp = (Vp, Ep) an undirect graph with Vp = {{V}} and Ep = (3) Execute DIVISION-GRAPH-A(G = (V, E)) (4) For each v Vp, take {Tv = {A A |A = v}}. Return Tv , v V, p and Gp.

TABLE 5. AN ALGORITHM FOR DECOMPOSING AXIOM GRAPH

Input: the axiom graph G = (V, E) Output: the decomposition graph Gp = (Vp, Ep) PROCEDURE DIVISION-GRAPH-A(G (V, E)) (1) Find the minimal value of NCut et resolving the equation (D W) x = Dx for eigenvectors with the smallest eigenvalues. (2) Use the eigenvector with the second smallest eigenvalue for decomposing the graph into two sub-graphs G1, G2. (3) Let Vp Vp \{{V}}{{V1}, {V2}} and Ep Ep {({V1}, {V2})}. We change the edges connecting to {V} for the links to one of {V1}, {V2} (4) After the graph is divided to two parties, we can implement recursively: DIVISION-GRAPH-A(G1), DIVISION-GRAPH-A(G2). V.
EXPERIMENT AND EVALUATION

Two graph decomposition algorithms based on minimal separator and image segmentation of ontology are implemented and return the same result satisfying decomposition criteria.

161

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

decomposing, we can directly to find the axioms in the components. Furthermore, the measure of NCut is normalized, it expresses the dissociation between the different parties and the association in each party of decomposing. However, the effectiveness of this method depends on the choice of appropriate parameters for calculating the relation of similarity between two axioms. We tested on the TBoxes in the FaCT system, as Vedaall, modkit, people, platt, and tambis. The results show that for the axioms whose expressions are more complex, the application of normalized cut method is much more effective (e.g, Vedaall, modkit), while the minimal separator method is better with the simple axioms (e.g, platt, tambis).

Figure 6. decomposition graph based on minimal separator of Tambis1

Figure 7. decomposition graph based on normalized cut of Tambis1

VI.

CONCLUSION

In this paper we have presented two techniques of decomposing from ontologies in Description Logics (TBox level). Our decomposition methods aim to reduce the number of GCIs [8], one of the main factors causing complexity to the algorithm argument. TBox separation method based on minimal separator only considers axioms in the aspect of syntax. We examine the simplest case, where the concept and role atoms are equivalent symbols in the axioms. However, in reality, they have different meanings. For example, the concept descriptions C t D and C u D will be represented by the same symbol graph with symbols, although their meaning is different. Therefore, we will keep on developing methods of ontology separation taking into account of the dependence between symbols based on linking elements and the semantics of the axioms. Besides, we will examine the query processing issue on decomposed ontologies. REFERENCES
[1]

[2]

Figure 4. Symbol graph of Tambis1

[3] [4]

[5]

[6]

Figure 5. Axiom graph of Tambis1

[7]

[8]

[9]

Boris Konev, Carsten Lutz, Denis Ponomaryov, Frank Wolter, Decomposing Description Logic Ontologies, KR 2010. Chiara Del Vescovo, Damian D.G.Gessler, Pavel Klinov, Bijan Parsia, Decomposition and modular structure of BioPortal Ontologies, In proceedings of the 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011. Dieter Jungnickel, Graphs,Networks and Algorithms. Springer1999. Eyal Amir and Sheila McIlraith, Partition-Based Logical Reasoning for First-Order and Propositional Theories. Artificial Intelligence, Volume 162, February 2005, pp.49-88. Jianbo Shi and Jitendra Malik, Normalized cuts and Image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000. Kirill Shoikhet and Dan Geiger, Finding optimal triangulations via minimal vertex separators. In Proceedings of the 3rd International Conference, p. 270281, Cambridge, MA, October 1992. Thi Anh Le PHAM and Nhan LE-THANH, Some approaches of ontology decomposing in Description Logics. In Proceedings of the 14th ISPE International Conference on Concurrent Engineering: Research and Applications, p.543-542, Brazil, July 2007. Thi Anh Le Pham, Nhan Le-Thanh, Peter Sander, Decomposition-based reasoning for large knowledge bases in description logics. Integrated Computer Aided Engineering (2008), Volume: 15, Issue: 1, Pages: 53-70. T.Kloks and D.Kratsch, Listing all minimal separators of a graph. In Proceedings of the 11th Annual Symposium on Theoretical Aspects of Computer Science, Spinger, Lecture Notes in Computer Science, 775, pp.759-768.

162

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

An Ontological Analysis of Common Research Interest for Researchers


Nawarat Kamsiang and Twittie Senivongse
Computer Science Program, Department of Computer Engineering Faculty of Engineering, Chulalongkorn University Bangkok, Thailand nawarat.k@student.chula.ac.th, twittie.s@chula.ac.th number of co-signers indicating cooperation at national and international level. Relational indicators include, for example, co-publication which indicates cooperation between institutions, co-citation which indicates the impact of two papers that are cited together, and scientific links measured by citations which traces who cites whom and who is cited by whom in order to trace the influence between different communities. Another common approach is analysis of research profiles. Such profiles can be constructed by gathering or mining information from electronic sources such as Web sites, publications, blogs, personal and research project documents etc. It is followed by discovering researcher expertise as well as semantic correspondences between researcher profiles. We are interested in the latter approach and use ontology as a means to describe research profiles. The idea that we explore is building ontological research profiles and using an ontology matching algorithm to compare similarity between profiles. To build an ontological research profile, we obtain keywords from the researchers publications that are indexed by ISI Web of Knowledge [3] and apply the Obtaining Domain Ontology (ODO) algorithm by An et al. [4] to build an ontology of terms that are related to the keywords. Terms in the profile are discovered by using WordNet lexical database [5]. To compare two research profiles, we adopt an algorithm called Multi-level Matching Algorithm with the neighbor search algorithm (MLMA+) proposed by Alasoud et al. [6], [7]. The algorithm considers name similarity and linguistic similarity between terms in the profiles. In addition, we add the concept of depth weights to ontology matching. A depth weight of a pair of matched terms is determined by the distance of the terms from the root of their ontologies. The motivation behind this is that we would pay more attention to a similar matched pair that are located near the bottom of the ontologies than to the matched pair that are near the root, since the terms at the bottom are considered more specialized areas of interest. A comparison between our methodology and MLMA+ is conducted using OAEI 2011 benchmark [8]. Section II of this paper discusses related work. Section III describes the algorithm for building ontological research profiles and a supporting tool. Section IV describes matching of the profiles. An evaluation of the methodology is presented

AbstractThis paper explores a methodology and develops a tool to analyze common research interest for researchers. The analysis can be useful for researchers in establishing further collaboration as it can help to identify the areas and degree of interest that any two researchers share. Using keywords from the publications indexed by ISI Web of Knowledge, we build ontological research profiles for the researchers. Our methodology builds upon existing approaches to ontology building and ontology matching in which the comparison between research profiles is based on name similarity and linguistic similarity between the terms in the two profiles. In addition, we add the concept of depth weights to ontology matching. A depth weight of a pair of matched terms is determined by the distance of the terms from the root of their ontologies. We see that more attention should be paid to the matched pair that are located near the bottom of the ontologies than to the matched pair that are near the root, since they are more specialized areas of interest. A comparison between our methodology and an existing ontology matching approach, using OAEI 2011 benchmark, shows that the concept of depth weights gives better precision but lower recall. Keywords-ontology matching building; ontology matching; profile

I.

INTRODUCTION

Internet technology has become a major tool that enriches the way people interact, express ideas, and share knowledge. Through different means such as personal Web sites, social networking applications, blogging, and discussion boards, people express their opinions, interest, and knowledge in particular matters from which a connection or relationship can be drawn. A community of practice [1] can also be formed among a group of people who share common interest or a profession so that they can learn from and collaborate with each other. For academics and researchers, it is useful to know who do what as well as who share common interest for the purpose of potential research collaboration. Different approaches have been taken to draw association between researchers. One is the use of bibliometrics to evaluate research activities, performance of researchers, and research specialization [2]. It is based on the enumeration and statistical analysis of scientific output such as publications, citations, and patents. Main bibliometric indicators are activity indicators and relational indicators. Activity indicators include the number of papers/patents, the number of citations, and the

163

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

in Section V and the paper concludes in Section VI with future outlook. II. RELATED WORK

Many researches analyze vast pools of information to find people with particular expertise, connection between these people, and shared interest among people. Some of these apply to research and academia. Tang et al. [9] present ArnetMiner which can automatically extract researcher profiles from the Web and integrate the profiles with publication information crawled from several digital libraries. The schema of the profiles is an extension of Friend-of-a-Friend (FOAF) ontology. They model the academic network using an authorconference-topic model to support search for expertise authors, expertise papers, and expertise conferences for a given query. Zhang et al. [10] construct an expertise network from postingreplying threads in an online community. A users expertise level can be inferred by the number of replies the user has posted to help others and whom the user has helped. Punnarut and Sriharee [11] use publication and research project information from Thai conferences and research council to build ontological research profiles of researchers. They use ACM computing classification system as a basis for expertise scoring, matching, and ranking. Trigo [12] extracts researcher information from Web pages of research units and publications from the online database DBLP. Text mining is used to find terms that represent each publication and then similarity between researchers with regard to their publications is computed. For further visualization of data, clustering and social network analysis are applied. Yang et al. [13] analyze personal interest in the online profile of a researcher and metadata of publications such as keywords, conference themes, and co-authors of the papers. By measuring similarity between such researcher data, social network of a researcher is constructed. It is seen that in the approaches above, various mining techniques are used in extracting information and discovering knowledge about researchers and their relationships, and the major source of researcher information is bibliographic information in online libraries. We are interested in trying a different and more lightweight approach to finding similar interest between researchers and their degree of similarity. We focus on using an ontology building algorithm to create research profiles followed by an ontology matching algorithm to find similarity between the profiles. III. BUILDING RESEARCH PROFILES In this section and the next, we describe our methodology together with a supporting tool that has been developed. The first part of the methodology is building research profiles for researchers. Like other related work, keywords from researchers publications are used to represent research interest. A. Researcher Information We retrieve researchers publication information during ten-year period (year 2002-2011), i.e., author names, keywords, subject area, and year published, from ISI Web of Knowledge [3] and store in a MySQL database for the processing of the

Web-based tool developed by PHP. Using the tool (Fig. 1), we can specify a pair of authors, subject area, and year published and the tool will retrieve corresponding keywords from the database. The tool lists the keywords by frequency of occurrence, and from the list, we can select ones that will be used for building the profiles. In the figure, we use an example of two authors named B. Kijsirikul and C. A. Ratanamahatana under Computer Science area. Five top keywords are selected as starting terms for building their profiles. B. Research Profile Building Algorithm In this step, we build a research profile as an ontology. We follow the Obtaining Domain Ontology (ODO) algorithm proposed by An et al. [4] since it is intuitive and can automatically derive a domain-specific ontology from any items of descriptive information (i.e., keywords, in this case). The general idea is to augment the starting keywords with terms and hypernym (i.e., parent) relation from WordNet [5] to construct ontology fragments as Directed Acyclic Graphs. The iterative process of weaving WordNet terms and joining together the terms will tie ontology fragments into one ontology representing research interest. Fig. 2 and Fig. 3 are Kijsirikuls and Ratanamahatanas profiles built from their top five keywords. Steps in the algorithm and enhancements we make to tailor for ISI keywords are as follows. 1) Select starting keywords: Select keywords as starting terms. For Kijsirikul, they are Dimensionality reduction, Semisupervised learning, Transductive learning, Spectral methods, and Manifold learning. For Ratanamahatana, they are Kolmogorov complexity, dynamic time warping, parameterfree data mining, anomaly detection, and clustering.

Figure 1. Specifying authors for profile building.

164

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Figure 2. Kijsirikuls ontological profile.

sentences to discover hypernym relations, but here we consider the pattern of the term. That is, if the term is a noun phrase consisting of a head noun and modifier(s), generalize the term by removing one modifier at a time and look up in WordNet. If found, use that generalized form as the hypernym. For example, in Fig. 2, the term Dimensionality reduction has reduction as the head noun and Dimensionality as the modifier, removing the modifier leaves us with the head noun reduction which can be found in WordNet, so reduction becomes the parent. In Fig. 3, the term parameter-free data mining has mining as the head noun, and parameter-free and data as modifiers. Removing parameter-free leaves us with the more generalized term data mining which can be found in WordNet and hence it becomes the parent. In the case that none of the generalized forms of the term are in WordNet, use the subject area as the hypernym. Some ISI keywords comprise a main term and an acronym in different formats, e.g., finite element method (FEM) or PTPC (percutaneous transhepatic portal catheterization). We consider the main term and apply the technique above. Therefore, the parent of finite element method (FEM) is method and the parent of PTPC (percutaneous transhepatic portal catheterization) is catheterization. 4) Build up ontology: Several parent-child relations that result from finding hypernyms become ontology fragments. Repeat steps 2) and 3) to further interweave hypernym terms until no more hypernyms can be found. 5) Merge ontology fragments: The final step is to merge the ontology fragments. If a term is found in two ontology fragments, the fragments are joined. At a joined node, if there are several upward paths from the node to the roots (from different ontology fragments), we pick the shortest path for simplicity. In Fig. 2, five ontology fragments, each with a starting keyword as the terminal node, can merge at learning, knowledge, and psychological feature nodes respectively. Since merging at psychological feature results in one single ontology, the parents of psychological feature (i.e., abstraction -> entity) are dropped. Another example that will be discussed in the next section is the profile of an author named A. Sudsang under Robotics area (Fig. 4). Five starting keywords are Grasping, grasp heuristic, Caging, positive span, and capture regions.

Figure 3. Ratanamahatanas ontological profile.

2) Find hypernyms in WordNet: For each term, look it up in WordNet for its hypernyms. Since a term may have several hypernyms, for simplicity, we select one with the maximum tag count which denote that the hypernym of a particular sense (or meaning) is most frequently used and tagged in various semantic concordance texts. In Fig. 3, the starting term clustering has the term agglomeration as its hypernym. If the term does not exist in WordNet but may be in a plural form (i.e., it ends with ches, shes, sses, ies, ses, xes, zes, or s), change to a singular form before looking up for hypernyms again. It is possible that one starting keyword may be found to be a hypernym of another. It is also possible that no hypernym is found for the term. If so, follow step 3). 3) Define hypernyms: If the term does not exist in WordNet, do any of the following. a) Use subject area as hypernym: If the term is a single word or an acronym, use the subject area of the author as its hypernym. Some ISI subject areas contain &, so in this case the words before and after & become hypernyms. For example, if the subject area is Science & Technology, Science and Technology become two parents of the term. b) Use the generalized form of the term as hypernym: This is in accordance with the lexico-syntactic pattern technique in [14] which considers syntactic patterns of

Figure 4. Sudsangs ontological profile.

165

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

IV.

MAT C H IN G RE S E A R C H PR O F IL E S

In this step, we match two ontological profiles. We adopt an effective algorithm called Multi-level Matching Algorithm with the neighbor search algorithm (MLMA+) proposed by Alasoud et al. [6], [7] since it uses different similarity measures to determine similarity between terms in the ontologies and also considers matching n terms in one ontology with m terms in another at the same time. A. MLMA+ The original MLMA+ algorithm for ontology matching is shown in Fig. 5. It has three phases. 1) Initialization phase: First, preliminary matching techniques are applied to determine similarity between terms in the two ontologies, S and T. Similarity measures that are used are name similarity (Levenshtein distance) and linguistic similarity (WordNet). Levenshtein distance determines the minimal number of insertions, deletions, and substitutions to make two strings equal [15]. For linguistic similarity, we determine semantic similarity between a pair of terms using a Perl module in WordNet::Similarity package [16]. Given Kijsirikuls ontology S comprising n terms and Ratanamahatanas ontology T comprising m terms, we compute a similarity matrix L(i, j) of size n x m which includes values in the range [0,1] called similarity coefficients, denoting the degree of similarity between the terms si in S and tj in T. A similarity coefficient is computed as an average of name similarity and linguistic similarity. For example, if Levenshtein distance between the terms s10 (change) and t23 (damage) is 0.2 and semantic similarity is 0.933, the similarity coefficient of these two terms is 0.567. The similarity matrix L for Kijsirikul and Ratanamahatana is shown in Fig. 6. Then, a user-defined threshold th is applied to the matrix L to create a binary matrix Map0-1. The similarity coefficient that is less than the threshold becomes 0 in Map0-1, otherwise it is 1. In other words, the threshold determines which pairs of terms are considered similar or matched by the user. Fig. 6 also shows Map0-1 for Kijsirikul and Ratanamahatana with th = 0.5. It represents the state that s2 is matched to t14, s10 is matched to t14 and t23 etc. This Map0-1 becomes the initial state St0 for the neighbor search algorithm. 2) Neighbor search and evaluation phases: In this step, we search in the neighborhood of the initial state St0. Each neighbor Stn is computed by toggling a bit of St0, so the total number of neighbor states is n*m. An example of a neighbor state is in Fig. 7. The initial state and all neighbor states are evaluated using the matching score function v (1) of [6], [7]:
n m i, j Li, j v ( Map L) k Map 0 1 0 1 i 1 j 1 n m i, j ; v th Map 0 1 i 1 j 1

Algorithm Match (S, T) begin /* Initialization phase K0; St0 preliminary_matching_techniques (S, T) ; Stf St0 ; /* Neighbor Search phase St All_Neighbors (Stn) ; While (K++ < Max_iteration) do /* Evaluation phase If score (Stn) > score (Stf) then Stf Stn ; end if Pick the next neighbor Stn St; St St Stn ; If St = then return Stf ; end Return Stf ; end Figure 5. Ontology matching algorithm MLMA+ [6], [7].

Figure 6. L and initial Map0-1 based on MLMA+.

Figure 7. Example of a neighbor state of initial Map0-1 in Fig. 6.

(1)

where k is the number of matched pairs and Map0-1 is Stn . The state with the maximum score is the answer to the matching; it indicates which terms in S and T are matched and the score represents the degree of similarity between S and T.

B. Modification to MLMA+ We make a change to the initialization phase of MLMA+ by adding the concept of depth weights which is inspired by [17]. A depth weight of a pair of matched terms is determined by the distance of the terms from the root of their ontologies. The motivation behind this is that we would pay more attention to a similar matched pair that are located near the bottom of the ontologies than to the matched pair that are near the root, since the terms near the bottom are considered more specialized areas of interest. From Fig. 6, consider s2 = event and t14 = occurrence. The two terms have similarity coefficient = 0.51. They are relatively more generalized terms in the profiles compared to the pair s10 = change and t23 = damage with similarity coefficient = 0.567. But both pairs are equally considered as matched interest. We are in favor of the matched pairs that are relatively more specialized and are motivated to decrease the degree of similarity between generalized matched pairs by using a depth weight function w (2):

166

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

wij = (rdepth(si) + rdepth(tj) ) / 2 ; wij is in (0,1] of its ontology = depth of the term t in its ontology . height of ontology

(2)
Algorithm MLMA+

TABLE II.

MATCHED PAIRS

where rdepth(t) = relative distance of the term t from the root

This depth weight will be multiplied with the similarity coefficient between si and tj to obtain a weighted similarity coefficient. Therefore the similarity matrix L(i, j) would change to include weighted similarity coefficients between the terms si and tj instead. For s2 = event and t14 = occurrence in Fig. 2 and Fig. 3, rdepth(s2) = 2/8 and rdepth(t14) = 5/10. Their depth weight w would be 0.375 and hence their weighted similarity coefficient would change from 0.51 to 0.191 (0.375*0.51). But for s10 = change and t23 = damage, rdepth(s10) = 5/8 and rdepth(t23) = 7/10. Their depth weight w would be 0.663 and hence their weighted similarity coefficient would change from 0.567 to 0.376 (0.663*0.567). It is seen that the more generalized the matched terms, the more they are penalized by the depth weight. Any matched terms that are both the terminal node of the ontology would not be penalized (i.e., w =1). Fig. 8 shows the new similarity matrix L, with weighted similarity coefficients, and the new initial Map0-1 for Kijsirikul and Ratanamahatana where th = 0.35. Note that for the pair s2 = event and t14 = occurrence, and the pair s10 = change and t14 = occurrence they are considered matched in Fig. 6 but are relatively too generalized and considered unmatched in Fig. 7. For s10 = change and t23 = damage, they survive the penalty and are considered matched in both figures. C. Matching Results of Example Table I shows matching results of the example when the original MLMA+ and its modification are used. Both algorithms agree that Kijsirikuls profile (Machine Learning) is more similar to Ratanamahatanas (Data Mining) than Sudsangs (Robotics). Matched pairs between Kijsirikul and Ratanamahatana are listed in Table II. MLMA+ gives a big list of matched pairs including those very generalized terms, while depth weights filter some out, giving a more useful list.

MLMA+ with depth weights

Matched Pairs (psychological feature, psychological feature), (event, event), (event, occurrence), (event, change), (knowledge, process), (power, process), (power, event), (power, quality), (process, process), (process, processing), (act, process), (act, event), (act, change), (action, process), (action, change), (action, detection), (basic cognitive process, basic cognitive process), (change, event), (change, occurrence), (change, change), (change, damage), (change, deformation), (change of magnitude, change), (reduction, change), (knowledge, perception) (basic cognitive process, basic cognitive process), (change, change), (change, damage), (change, deformation), (change, warping), (reduction, change), (reduction, detection), (reduction, damage), (reduction, deformation), (change of magnitude, deformation)

V.

EVALUATION AND DISCUSSION

Our ontology matching algorithm is evaluated using OAEI 2011 benchmark test sample suite [8]. The benchmark provides a number of test sets in a bibliographic domain, each comprising a test ontology in OWL language and a reference alignment. Each test ontology is a modification to the reference ontology #101 and is to be aligned with the reference ontology. Each reference alignment lists expected alignments. So in the test set #101, the reference ontology is matched to itself, and in the test set #n, the test ontology #n is matched to the reference ontology. The quality indicators we use are precision (3), recall (4), and F-measure (5).
Pr ecision no.of expected alignments found as matched by algo. no.of matched pairs found by algo.

(3) (4) (5)

Re call

no.of expected alignments found as matched by algo. no.of expected alignments


F measure 2 x Precision x Recall Precision Recall

Figure 8. L and initial Map0-1 based on MLMA+ with depth weights.

TABLE I. MATCHING SCORES Algorithm MLMA+ MLMA+ with depth weights Author 1 Kijsirikul Kijsirikul Author 2 Ratanamahatana Sudsang Ratanamahatana Sudsang Matching Score 0.627 0.581 0.411 0.372

Table III shows the evaluation results with th = 0.5. We group the test sets into four groups. Test set #101-104 contain test ontologies that are more generalized or restricted than the reference ontology by removing or replacing OWL constructs that make the concepts in the reference ontology generalized or restricted. Test set #221-247 contain test ontologies with structural change such as no specialization, flattened hierarchy, expanded hierarchy, no instance, no properties. The quality of both algorithms with respect to these two groups is quite similar since these modifications do not affect string-based and linguistic similarities which are the basis of both algorithms. Test set #201-210 contain test ontologies which relate to change of names in the reference ontology, such as by renaming with random strings, misspelling, synonyms, using certain naming convention, and translation into a foreign language. Both algorithms are more sensitive to this test set. Test set #301-304 contain test ontologies which are actual bibliographic ontologies. According to an average F-measure, MLMA+ with depth weights is about the same quality as MLMA+ as it gives better precision but lower recall. MLMA+ discovers a large number

167

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

of matched pairs whereas depth weights can decrease this number and hence precision is higher. But at the same time, recall is affected. This is because the reference alignments only lists pairs of terms that are expected to match. That is, for example, if the test ontology and the reference ontology contain the same term, the algorithm should be able to discover a match. But MLMA+ with depth weights considers the presence of the terms in the ontologies as well as their location in the ontologies. So an expected alignment in a reference alignment may be considered unmatched if they are near the root of the ontologies and are penalized by the algorithm. The user-defined threshold th in the initialization phase of MLMA+ is a parameter that affects precision and recall. If th is too high, only identical terms from the two ontologies would be considered as matched pairs (e.g., (psychological feature, psychological feature)), and these identical pairs mostly are located near the root of the ontologies. We see that discovering only identical matched pairs are not very interesting given that the benefit of using WordNet and linguistic similarity between non-identical terms is not present in the matching result. On the contrary, if th is too low, there would be proliferation of matched pairs because, even a matched pair is penalized by depth weight, its weighted similarity coefficient remains greater than the low th. The values th that we use for the data set in the experiment trades off these two aspects; it is the highest threshold by which the matching result contains both identical and non-identical matched pairs. The complexity of the ODO algorithm for building an ontology S depends on the number of terms in S and the size of the search space when joining any identical terms in S into n single nodes, i.e., O( 2 ) where the number of ontology terms n = number of starting keywords * depth of S, given that, in the worst case, all starting keywords have the same depth. For MLMA+ and MLMA+ with depth weights, the complexity depends on the size of the search space when matching two ontologies S and T, i.e., O((n*m)2) when n and m are the size of S and T respectively. VI. CONCLUSION

For future work, further evaluation using a larger corpus and evaluation on performance of the algorithms are expected. An experience report on practical use of the methodology will be presented. It is also possible to adjust the ontology matching step so that the structure of the ontologies and the context of the terms are considered. In addition, we expect to explore if the methodology can be useful for discovering potential crossfield collaboration. REFERENCES
[1] A. Cox, What are communities of practice? A comparative review of four seminal works, J. of Information Science, vol. 31, no. 6, pp. 527540, December 2005. Y. Okubo, Bibliometric Indicators and Analysis of Research Systems: Methods and Examples, Paris: OECD Publishing, 1997. ISI Web of Knowledge, http://www.isiknowledge.com, Last accessed: January 24, 2012. Y. J. An, J. Geller, Y. Wu, and S. A. Chun, Automatic generation of ontology from the deep Web, in Procs. of 18th Int. Workshop on Database and Expert Systems Applications (DEXA07), 2007, pp. 470474. WordNet, http://wordnet.princeton.edu/, Last accessed: January 24, 2012. A. Alasoud, V. Haarslev, and N. Shiri, An empirical comparison of ontology matching techniques, J. of Information Science, vol. 35, pp. 379-397, March 2009. A. Alasoud, V. Haarslev, and N. Shiri, An effective ontology matching technique, in Procs. of 17th Int. Conf. on Foundations of Intelligent Systems, 2008, pp. 585-590. Ontology Alignment Evaluation Initiative 2011 Campaign, http://oaei.ontologymatching.org/2011/, Last accessed: January 24, 2012. J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, ArnetMiner: Extraction and mining of academic social networks, In Procs. of 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2008), 2008, pp. 990-998. J. Zhang, M. Ackerman, and L. Adamic, Expertise network in online communities: structure and algorithms, In Procs. of 16th Int. World Wide Web Conf. (WWW 2007), 2007, pp. 221-230. R. Punnarut and G. Sriharee, A researcher expertise search system using ontology-based data mining, in Procs. of 7th Asia-Pacific Conference on Conceptual Modelling (APCCM 2010), 2010, pp. 71-78. L. Trigo, Studying researcher communities using text mining on online bibliographic databases, In Procs. of 15th Portuguese Conf. on Artificial Intelligence, 2011, pp. 845-857. Y. Yang, C. A. Yueng, M. J. Weal, and H. C. Davis, The researcher social network: A social network based on metadata of scientific publications, In Procs. of Web Science Conf. 2009 (WebSci 2009), 2009. M. A. Hearst, Automated discovery of WordNet relation, in Wordnet: An Electronic Lexical Database and Some of its Applications, Cambridge, MA: MIT Press, 1998, pp. 132-152. G. Navarro, A guided tour to approximate string matching, ACM Computing Surveys, vol.33, pp. 31-88, March 2001. Wordnet::Similarity, http://sourceforge.net/projects/wn-similarity, Last accessed: January 24, 2012. H. Yang, S. Liu, P. Fu, H. Qin, and L. Gu, A semantic distance measure for matching Web services, in Procs. of Int. Conf. on Computational Intelligence and Software Engineering (CiSE), 2009, pp. 1-3.

[2] [3] [4]

[5] [6]

[7]

[8]

[9]

[10]

[11]

This work presents an ontology-based methodology and a supporting Web-based tool for (1) building research profiles from ISI keywords and WordNet terms by applying the ODO algorithm, and (2) finding similarity between the profiles using MLMA+ with depth weights. An evaluation using the OAEI 2011 benchmark shows that depth weights can give good precision but lower recall.
TABLE III. MLMA+ Test Set #101-104 #201-210 #221-247 #301-304 Average
Prec. Rec. Fmeasure

[12]

[13]

[14]

EVALUATION RESULTS MLMA+ with Depth Weights


Prec. Rec. Fmeasure

[15] [16] [17]

0.74 0.35 0.71 0.56 0.59

1.0 0.24 0.99 0.75 0.74

0.85 0.26 0.82 0.64 0.64

0.93 0.68 0.94 0.90 0.86

0.84 0.18 0.66 0.57 0.56

0.88 0.27 0.75 0.68 0.64

168

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Automated Software Development Methodology: An agent oriented approach


Sudipta Acharya Dept. of Information technology National Institute of Technology Durgapur, India sonaacharya.2009@gmail.com Prajna Devi Upadhyay Dept. of Information technology National Institute of Technology Durgapur, India kirtu26@gmail.com Animesh Dutta Dept. of Information technology National Institute of Technology Durgapur, India animeshrec@gmail.com

Abstract In this paper, we propose an automated software development methodology. The methodology is conceptualized with the notion of agents, which are autonomous goal-driven software entities. They coordinate and cooperate with each other, like humans in a society to achieve some goals by performing a set of tasks in the system. Initially, the requirements of the newly proposed system are captured from stakeholders which are then analyzed in goal oriented model. Finally, the requirements are specified in the form of goal graph, which is the input to the automated system. Then this automated system generates MAS (Multi Agent System) architecture and coordination of the agent society to satisfy the set of requirements by consulting with the domain ontology of the system. Keywords-Agent; Multi Agent System;Agent Oriented Software Engineering; Domain Ontology; MAS Architecture; MAS Coordination; Goal Graph.

Recently, transformation systems based on formal models to support agent system synthesis are emerging fields of research. There are currently few AOSE methodologies for multi agent systems, and many of those are still under development. II. RELATED WORK

I.

INTRODUCTION

A. Agent and Multi agent system An agent[1, 2] is a computer system or software that can act autonomously in any environment, makes its own decisions about what activities to do, when to do, what type of information should be communicated and to whom, and how to assimilate the information received. Multi-agent systems (MAS) [1, 2] are computational systems in which two or more agents interact or work together to perform a set of tasks or to satisfy a set of goals. B. Agent Oriented Software Engineering The advancement from assembly level programming to procedures and functions and finally to objects has taken place to model computing in a way we interpret the world. But there are inherent limitations in an object that makes it incapable of modeling a real world entity. It was for this reason that we move to agents and Multi agent systems, which model a real world entity in a better way. As agent technology has become more accepted, agent oriented software engineering (AOSE) also has become an important topic for software developers who wish to develop reliable and robust agent-based software systems [3, 4, 5]. Methodologies for AOSE attempt to provide a method for engineering practical multi agent systems.

Recent work has focused on applying formal methods to develop a transformation system to support agent system synthesis. Formal transformation systems [6, 7, 8] provide automated support to system development, giving the designer increased confidence that the resulting system will operate correctly, despite its complexity. In [9] authors have proposed a Goal oriented language GRL and a scenarios oriented architectural notation UCM to help visualize the incremental refinement of architecture from initially abstract description. But the methodology proposed is informal and due to this the architecture will vary from developer to developer. In [10, 11] a methodology for multi agent system development based on goal model is proposed. Here, MADE (Multi Agent Development Environment) tool has been developed to reduce the gap between design and implementation. The tool takes the agent design as input and generates the code for implementation. The agent design has to be provided manually. Automation has not been shown for generation of design from user requirements. A procedure to map the requirements to agent architecture is proposed in [12]. The TROPOS methodology for building agent oriented software system is introduced in [13]. But the methodologies proposed in both [12] and [13] are informal approaches. III. SCOPE OF WORK

There are very few AOSE methodologies for automated design of the system from user requirements. But, most of the work follows an informal approach due to which the system design may not totally satisfy the user requirements. Also the system design varies from developer to developer. There is a need to reduce the gap between the requirements specification and agent design and to develop a standard methodology which can generate the design from user requirements irrespective of the developers. In this work we have concentrated on developing a standard methodology by which

169

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

we can generate the design of software from user requirements which will be developer independent. In this paper, we develop an automated system which takes the user requirements as input and generates the MAS architecture and coordination with the help of domain knowledge. The basic requirements are analyzed in a goal oriented fashion [14] and represented in the form of goal graph while the domain knowledge is represented with the help of ontology [15]. The output of the developed system is MAS architecture which consists of a number of agents and their capabilities and MAS coordination represented through Task Petri Nets. The Task Petri Nets tool can model the coordination among the agents to maintain the inherent dependencies between the tasks. IV. PROPOSED METHODOLOGY

have been used in Software Engineering to model requirements and non-functional requirements for a software system. Formally we can define Goal Graph as G=(V, E), consisting of A set of nodes V={V1, V2,,Vn} where each Vi is a goal to be achieved in a system, 1<=i<=n. A set of edges E. There are two types of edges, represented by and . A function, subgoal : (VV) Bool. subgoal (Vi, Vj)=true if Vj is an immediate sub goal of Vi. Function hb: (VxV)Bool. hb(Vi, Vj)=true if the user specifies that goal represented by Vi should be satisfied before the goal represented by Vj is satisfied. An edge exists between two vertices Vi and Vj if subgoal(Vi,Vj)=true, Vi, Vj V. An edge exists between two vertices Vi and Vj if hb(Vi,Vj)=true, Vi, Vj V.

Fig. 1 represents the architecture of our proposed automated system. The basic requirements are taken from the user as input. Since Requirements Analysis is an informal process, the input requirements can be captured from the user in the form of a text file or any other desirable format. These requirements are further analyzed and represented in the form of a Goal Graph. The domain knowledge is also an input and is represented in the form of ontology. The automated system returns the MAS Architecture and MAS Coordination as output. The MAS Coordination is represented in the form of Task Petri Nets. Thus, the automated system takes the requirements and the domain knowledge as input and generates the MAS Architecture and MAS Coordination as output. So, we can say,

B. Domain Knowledge represented by Ontology A domain ontology [15] defined on a domain M is a tuple O = (TCO, TRO, I, conf, <, B, IR), where we have extended the tuple definition by adding another function IR as per our requirements. TCO= {c1, c2,,ca}, is the set of concept types defined in domain M. Here, TCO= {task, goal}. In diagram, a concept is represented by TRO= {consists_of, happened_before}. In diagram, a relation is represented by I is the set of instances of TCO, from the domain M. conf: ITCO, it associates each instance to a unique concept type. : (TC X TC) (TR X TR) {true, false}, <(c,d)=true indicates c is a subtype of d B: TR -> (TC) where TR B(r) = {c1,..,cn}, where n is a variable associated with r. The set {c1,..,cn} is an ordered set and could contain duplicate elements. Each ci is called an argument of r. The number of elements of the set is called the arity (or valence) of the relation. B(consists)={goal,,goal, task,,task} IR: TRO(I), where ai (I), if ai is the ith element of (I), then conf(ai) = ti, where ti is the ith element of B(TRO).

Figure 1. Architecture of the proposed automated system

MAS Architecture= f (Requirements, Domain ontology) MAS Coordination=f (Requirements, Domain ontology, MAS architecture) The architecture of the proposed automated system is described in the following sub-section. A. Requirements represented by Goal Graph Agents in MAS are goal oriented i. e. all agents perform collaboratively to achieve some goal. The concept of goal has been used in many areas of Computer Science for quite some time. In AI, goals have been used in planning to describe desirable states of the world since the 60s. More recently, goals

C. Semantic Mapping from requirements to Ontology Concepts The process by which the basic keywords of the leaf node sub goal are mapped into concepts in the Ontology is called Semantic Mapping. In this paper, the aim of semantic mapping is to find out tasks from Domain Ontology, required to be performed to achieve a sub-goal given as input from the Goal Graph. Let there be a set of task concepts T={t1,t2...tn} associated with a consists_of relation in an ontology. Let there

170

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

be set of goal concepts G={g1,g2...gm} also associated with that consists_of relation. Now let from user side requirements come of which after Requirements Analysis goal graph consists set of leaf node sub goals, G0={G1,G2,G3....Gp}. Let ky be a function that maps a sub-goal to its set of keywords. The set of keywords for sub goal Gi G0 can be represented by ky(Gi)={ky1,ky2...kyd}. Now the set of tasks T will be performed to achieve sub goal Gi iff either the mapping f : ky(Gi)G is a bijective mapping where Gi G0 , or there exists a subset of GO, {Gi, Gj ,, Gk} GO, such that f : ky(Gi) U ky(Gj) U....U ky(Gk) G is a bijective mapping. D. MAS Architecture MAS architecture consists of set of agents with their capability sets i.e set of tasks that an agent can perform. Formally we can define agent architecture as, < AgentID, {capability set}> where AgentID is unique identification number of agent, and capability set is set of tasks {t1, t2,......,tn} that the corresponding agent is able to perform. MAS architecture can be defined as a set of agents with their corresponding architectures. E. MAS Coordination represented by Task Petri Nets A Task Petri Nets is an extended Petri Nets tool that can model the MAS coordination. It is a six tuple, TPN = (P, TR, I, O, TOK, Fn) where P is a finite set of places. There are 8 types of places, P= Pt Ph Pc Pe Pf Pr Pa Pd. Places Ph, Pc, Pe, Pf exist for each task already identified by the interface agent. The description of the different types of places is: Ph: A token in this place indicates that the task represented by this place can run, i.e. all the tasks that were required to be completed for this task to run are completed. Pc: A token in this place indicates that an agent has been assigned for this task. Pe: A token in this place indicates agent and resources have been allocated for the task represented by the place and the task is under execution by the allocated agent. Pf: A token in this place indicates that the task represented by this place has finished execution. Pr: such a place exists for each type of a resource in the system , ri Pri ri R, 1iq Pa: such a place exists for each instance of an agent in the system , ai Pai ai A, 1ip Pt: it is the place where the tasks identified by the interface agent initially reside.

8.

Pd: such place is created dynamically after the agent has been assigned for the task and the agent decides to divide the tasks into subtasks. For each subtask, a new place is created. TR is the set of transitions. There are 5 types of transitions TR=th tc te tf td, where th , te , tf exist for every task identified by the interface agent. th: This transition fires if the task it represents is enabled i.e. all the tasks which should be completed for the task to start are complete. tc: This transition fires if the task it represents is assigned an agent which is capable of performing it. te: This transition fires if the all resources required by the task it represents are allocated to it. tf: This transition fires if the task represented by the transition is complete. td: This transition is dynamically created when the agent assigned for the task it represents decides to split the task further into sub-tasks. The subnet that is formed dynamically consists of places and transitions all of which are categorized as Pd or td respectively. I is the set of input arcs, which are of the following types 1. 2. 3. 4. 5. I1=Pt X th: task checked for dependency I2=PrX te: request for resources I3=Pe X tf: task completed I4=Pf X th: interrupt to successor task I5=PcXtd PaXtd PrXtd PdXtf: input arcs of the subnet formed dynamically

1.

2. 3. 4. 5.

1.

O is the set of output arcs, which are of the following types: 1. 2. 3. 4. 5. 6. 7. O1: thXPh: task not dependent on any other task O2: tcXPc: agent assigned O3: teXPe: resource allocated O4: tfXPr: resource released O5: tfXPf: Task completed by agent O6: tfXPa: agent released O7: tdXPd: output arcs of the subnet formed dynamically

2. 3.

4. 5. 6. 7.

TOK is the set of color tokens present in the system, TOK={TOK1,TOK2,,TOKX}, where each TOKi, 1ix, is associated with a function assi_tok defined as:

assi_tok: TOK Category X Type X N, where, Category = set of all categories of tokens in the system= {T, R, A}, Type = set of all types of each categoryi Category i.e. Type= T R A, N is the set of natural numbers. Let assi_tok(TOKi)=(categoryi,

171

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

typei, ni). The function assi_tok satisfies the following constraint: TOKi (categoryi=R){(typei R) (1ni inst_R(typei))} TOKi (categoryi=A){(typei A) (1ni inst_A(typei))} TOKi (categoryi=T){(typei T) (ni=1)} assi_tok defines the category, type and number of instances of each token. Fn is a function associated with each place and token. It is defined as:

Fn: P X TOK(TIME X TIME). For a token TOKk TOK, 1kx, and place Pl P, Fn(Pl, TOKk)={(ai, aj)}, ai is the entry time of TOKk to place Pl and aj is the exit time of TOKk from place Pl. For a token entering and exiting a place multiple times, |Fn(Pl,TOKk)|=number of times TOKk entered the place Pl. The process by which MAS architecture and MAS coordination is generated from requirements is shown as a flowchart in Fig. 2.

Figure 2. Flowchart of the proposed methodology

172

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

V.

CASE STUDY

Let us start with the case study by applying our proposed methodology. We take Library system as our case study application. Fig. 3 shows the ontology of a Library System. The ontology consists of some concepts and relations. There is a TASK concept type in the ontology which describes the task that should be performed to achieve some goal. There are other concepts that collectively describe some sub-goal to be achieved in the library system. For e.g. the concept types Check, Validity, Member, collectively describe the subgoal Check the membership validity. There are two types of relations i) consists_of and ii) happened_before. Here we denote happened_before relationship by H-B. The consists_of relation exists between some set of concepts describing a sub- goal and a set of instances of TASK concept. The happened_before relation exists between two instances of TASK concept. In the figure, the consists_of relation has incoming arcs from the concepts types Check, Validity, Member and outgoing arcs to TASK concepts Get library identity card of member, Check for validity of that ID card. It means that these two tasks have to be performed to achieve sub-goal described by concepts Check, Validity, Member i.e. tasks Get library identity card of member and Check for validity of that ID card have to be performed to

achieve sub-goal Check the membership validity. The happened_before relationship exists between these two tasks which means that task Check for validity of that ID card cannot start until task Get library identity card of member is completed. Now consider from user side requirements come as Delete account of member with member id <i> and book with book id <j> from database.It is the main goal. Step 1: We have to perform goal oriented Requirements Analysis of main requirements. It is an informal process performed by the Requirements Analysts and after Requirements Analysis, it is represented by Goal Graph shown in Fig. 4. Step 2: The leaf node sub goals are given as input to the automated system. By semantic mapping [16, 17] system maps each basic keyword of leaf sub goals to the goal concepts of ontology, and finds out set of tasks required to be performed to achieve those sub-goals. This is shown in Fig. 5. Step 3: The tasks that we get from step 2 are used to form task graph. Dependency between these tasks is known from ontology. The task graph is shown in Fig. 6 where task A implies Check that requirement of book id <j> < threshold, if yes then continue, else stop.

Figure 3.

Ontology Diagram of Library System

173

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Figure 4. Goal Graph representation of basic requirements

Step 4: Using Task Graph of Fig. 6, we find out the number of agents and their capability set following the methodology shown as a flowchart in Fig. 2. The maximum number of concurrent agents at any level is 2, so we create 2 agents- A1 and A2. Let C be assigned to A1s capability set and D to A2s capability set, <A1, {C}>, <A2, {D}>. Both C and D have single predecessor, B. So, B is added to the capability set of either A1 or A2. Let it be added to the capability set of A1. So, we have <A1, {B, C}>. Now, B has a single predecessor, A. So, A is added to the capability set of A1. So, we have <A1, {A, B, C}>. There are no other predecessors at level higher than A. D has a single successor, E. So, E is added to the capability set of A2. So, we have <A2, {D, E}>. The total number of agents deployed is 2 and the MAS architecture is <A1, {A, B, C}>, <A2, {D, E}>. Step 5: Using the Task Graph of Fig. 6 and MAS architecture developed in step 4, MAS coordination is formed i.e. to satisfy user requirements, how a set of required agents (A1, A2) will perform a set of required tasks (A, B, C, D, E) collaboratively can be represented by Task Petri Nets shown in Fig. 7.

Figure 5. Procedure for Semanic Mapping

Task B implies Check both book & member database whether member id <i> has not returned book, and any book <j> is not returned by any member. i. e in task B two checking operations are there. Task C implies Delete member id <i> account. Task D implies Remove book id <j> from library Task E implies Delete entry of book id <j> from database

Figure 7. Task Petri Nets representation of MAS coordination

VI.

CONCLUSION

Figure 6. Task Graph for the set of tasks found from Semantic Mapping

In this paper, we have developed an automated system to generate MAS architecture and coordination from the user requirements and domain knowledge. It is a formal methodology which is developer independent i.e. it produces same MAS architecture and coordination for the same set of requirements and domain knowledge. The future work is to include a verification module to check whether the developed architecture satisfies the requirements. The module can work at two levels, firstly after Requirements Analysis, it can check whether the analysis satisfies main requirements, and secondly, it can verify whether the MAS coordination satisfies main requirements.

174

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

REFERENCES
[1] [2] [3] [4] G. Weiss, Ed., Multiagent systems: a modern approach to distributed artificial intelligence, MIT Press (1999) M. J. Wooldridge, Introduction to Multiagent Systems, John Wiley & Sons, Inc(2001) N. Jennings, On Agent-based Software Engineering, Artificial Intelligence: 117 (2000) 277-296 J. Lind, Issues in Agent-Oriented Software Engineering, In P. Ciancarini , M. Wooldridge (eds.), Agent-Oriented Software Engineering: First International Workshop, AOSE 2000. Lecture Notes in Artificial Intelligence, Vol. 1957. Springer-Verlag, Berlin M. Wooldridge, , P. Ciancarini, Agent-Oriented Software Engineering: the State of the Art, In P. Ciancarini, , M. Wooldridge, (eds.), AgentOriented Software Engineering: First International Workshop, AOSE 2000. Lecture Notes in Artificial Intelligence, Vol. 1957. SpringerVerlag, Berlin Heidelberg (2001) 1-28 C. Green, D. Luckham, R. Balzer , et al, Report on a Knowledge-Based Software Assistant. In C. Rich, , R. C. Waters, (eds.), Readings in Artificial Intelligence and Software Engineering. Morgan Kaufmann, San Mateo, California (1986) 377-428 T. C. Hartrum , R Graham, The AFIT Wide Spectrum Object Modeling Environment: An AWESOME Beginning, Proceedings of the National Aerospace and Electronics Conference. IEEE (2000) 35-42 R. Balzer, T. E. Cheatham, Jr., and C. Green, Software Technology in the 1990s: Using a new Paradigm,, Computer, pp. 39-45, Nov 1983. L. Liu, E. Yu, From Requirements to Architectural Design Using Goals and Scenarios.

[5]

[6]

[7]

[8] [9]

[10] Z. Shen, C. Miayo, R. Gay, D. Li, Goal Oriented Methodology for Agent System Development, IEICE TRANS. INF. & SYST., VOL.E89D, NO.4 APRIL 2006. [11] S. Zhiqi, Goal oriented Modelling for Intelligent Agents and their Applications, Ph.D. Thesis, Nanyang Technological University, Singapore, 2003. [12] Clint H. Sparkman, Scott A. DeLoach, Athie L. Self, Automated Derivation of Complex Agent Architectures from Analysis Specifications, Proceedings of the Second International Workshop On Agent-Oriented Software Engineering (AOSE-2001), Montreal, Canada, May 29th 2001. [13] P. Bresciani, A. Perini, P. Giorgini, F. Giunchiglia, J. Mylopoulos, Tropos: An Agent-Oriented Software Development Methodology, Autonomous Agents and Multi-Agent Sytems, 8, 203236, 2004. [14] Paolo Giorgini, John Mylopoulos, and Roberto Sebastiani, GoalOriented Requirements Analysis and Reasoning in the Tropos Methodology, Engineering Applications of Artificial Intelligence,Volume 18, 159-171, 2005. [15] P.H.P. Nguyen, D. Corbett, A basic mathematical framework for conceptual graphs, In: IEEE Transactions on Knowledge and Data Engineering, Volume 18, Issue 2, 2005. [16] Haruhiko Kaiya, Motoshi Saeki, Using Domain Ontology as Domain Knowledge for Requirements Elicitation, 14th IEEE International Requirements Engineering Conference (RE'06) [17] Masayuki Shibaoka, Haruhiko Kaiya, and Motoshi Saeki, GOORE : Goal-Oriented and Ontology Driven Requirements Elicitation Method, J.-L. Hainaut et al. (Eds.): ER Workshops 2007, LNCS 4802, pp. 225 234, 2007.

175

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Agent Based Computing Environment for Accessing Previleged Servics


Navin Agarwal
Dept. of Information Technology National Institute of Technology, Durgapur navin0706@gmail.com

Animesh Dutta
Dept. of Information Technology National Institute of Technology, Durgapur animeshnit@gmail.com The agent operates on the percepts in some fashion and generates actions that could affect the environment. This general flow of activities, i.e., sensing the environment, processing the sensed data/ information and generating actions that can affect the environment, characterizes the general behavior of all agents. B. MAS(Multi Agent System) Multi-agent systems (MASs) [2, 5] are computational systems in which two or more agents interact or work together to perform a set of tasks or to achieve some common goals [5-8]. Agents of a multi-agent system (MAS) need to interact with others toward their common objective or individual benefits of themselves. A multi-agent system can be studied as a computer system that is concurrent, asynchronous, stochastic and distributed. A multi agent system permits to coordinate the behavior of agents, interacting and communicating in an environment, to perform some tasks or to solve some problems. It allows the decomposition of complex task in simple sub-tasks which facilitates its development, testing and updating. The client agent outside the network which is subscribed to certain services need to interact with some agent residing inside the network, which will do the work on behalf of the user and send him back the result. In this paper we propose the whole architecture of the system, and how different agents will interact with each other. To develop the MAS, we will use JADE which is a software framework fully implemented in Java language. It simplifies the implementation of multi-agent systems through a middle-ware that claims to comply with the FIPA specifications and through a set of tools that supports the debugging and deployment phase. The agent platform can be distributed across machines (which not even need to share the same OS) and the configuration can be controlled via a remote GUI. The configuration can even be changed at runtime by creating new agents and moving agents from one machine to another one, as and when required. The only system requirement is the Java Run Time version 5 or later. The communication architecture offers flexible and efficient messaging, where JADE creates and manages a queue of incoming ACL messages, private to each agent. Agents can access their queue via a combination of several

Abstract In this paper we propose an application for accessing privileged services on the web, which is deployed on JADE (Java Agent Development Framework) platform. There are many Organizations/ Institutes which have subscribed to certain services inside their network, and these will not be accessible to people who are a part of the Organization/Institute when they are outside their network (for example in his residence). Therefore we have developed two software agents; the person will request the Client Agent (which will be residing outside the privileged network) for accessing the privileged services. The Client Agent will interact with the Server Agent (which will be residing inside the network which is subscribed to privileged services), which will process the request, and send the desired result back to the Client Agent.

I. INTRODUCTION Many Organizations/Institutes have subscription to certain services inside their network, for example here at NIT Durgapur there is subscription of IEEE and ACM. When outside the network, these services cannot be accessed. We plan to address this problem and also automate the whole process so that so that human effort is reduced. To solve the problem we will build an agent based system, where multiple agents will interact with each other to solve the problem. When we talk about multiple agents interacting, the system becomes a Multi-Agent system, descriptions of which are given below. A. Agent An agent is a computer system or software that can act autonomously in any environment. Agent autonomy relates to an agents ability to make its own decisions about what activities to do, when to do, what type of information should be communicated and to whom, and how to assimilate the information received. An agent in the system is considered a locus of problem-solving activity; it operates asynchronously with respect to other agents. Thus, an intelligent agent inhabits an environment and is capable of conducting autonomous actions in order to satisfy its design objective [15]. Generally speaking, the environment is the aggregate of surrounding things, conditions, or influences with which the agent is interacting. Data/information is sensed by the agent. This data/information is typically called percepts.

176

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

modes: blocking, polling, timeout and pattern matching based. The full FIPA communication model has been implemented and its components have been clearly distinct and fully integrated: interaction protocols, envelope, ACL, content languages, encoding schemes, ontologies and, finally, transport protocols. The transport mechanism, in particular, is like a chameleon because it adapts to each situation, by transparently choosing the best available protocol. Most of the interaction protocols defined by FIPA are already available and can be instantiated after defining the application-dependent behavior of each state of the protocol. SL and agent management ontology have been implemented already, as well as the support for user-defined content languages and ontologies that can be implemented, registered with agents, and automatically used by the framework. II. RELATED WORK Agent-based models have been used since the mid-1990s to solve a variety of business and technology problems. Examples of applications include supply chain optimization [9] and logistics [10], distributed computing [11], workforce management [12], and portfolio management [13]. They have also been used to analyze traffic congestion [14]. In these and other applications, the system of interest is simulated by capturing the behavior of individual agents and their interconnections. In this paper [15] a framework for constructing application in mobile computing environment has been proposed. In this framework an application is partitioned into two pieces, one runs on a mobile computer and another runs on a stationary computer. They are constructed by composing small objects, in which the stationary computer does the task for the mobile computer. This system is based on the service proxy, and is not autonomous. In our work we are building our system based on agents which adds a lot of flexibility and is autonomous. A Multi-Agent system [16] for accessing remote energy meters from electricity board is related to this work. In this the server said to be the host is located in the electricity board, and all the customers are the clients connected with the server. This MAS system helps in automating the task and thus replacing the human agents. It is similar to our scenario where we are automating the task of downloading papers from IEEE/ACM sites, and replacing human agents which can do the task being inside the privileged network. In this paper [17] architecture has been proposed for secure and simplified access to home appliances using Iris recognition, adding an additional layer of security and preventing unauthorized access to the home appliances. This model is also based on server and client approach, where the server will reside inside the home, and client will reside outside the home and send request to the server for performing task on behalf of the user. Advanced method for downloading webpages from the internet has been proposed here [18], we will be using many concepts from these to improve the working of the server agent and more utilization of bandwidth.

III.

PROBLEM OVERVIEW

There are many networks which have privilege accessibility to many sites and servers. For example being inside NIT Durgapur, there is no authentication required for downloading papers and other documents from IEEE and ACM sites. A user who has the right to access that privileged network, but if he is outside that network he will not be able to. There can be scenarios in which an Institute or Organization, can pay for some services to be accessed inside their network. In such situations the user has to be inside the network to enjoy those services or, they can access the network from outside by means of a Proxy Server (There are some more possibilities). IV. SCOPE OF WORK The aim of this work is to automate this whole process. Make the work of the user easy and take advantage of the services or privilege that he is entitled to access being a part of the Institute or Organization. Developing the agent in JADE allows us to implement it for Mobile devices also. The only requirement for running any JADE agent is, Java Run Time Environment, which most of the System and Mobile Devices have. In this project, the user will just need to send the keyword, and all the related documents matching that keyword will be downloaded and sent to the user. The user need not wait for the whole process to finish. He just needs to send the request, and the Multi-Agent System will perform the task for the User. The main purpose of technology is to ease human work, so that the effort can be put to do more useful work. This project targets that specific purpose, with some added benefits to the user. V. MULTI-AGENT BASED ARCHITECTURE The agent system is divided in two parts: There will be one single agent called Server Agent which serves the requests of multiple users. Multiple Client Agents which will send a request to the Server Agent in form of a Keyword.

A.

Server Agent This Agent will run autonomously inside the network, which has privileged access, or has been authorized to a service. It will always be ready to accept request from client. Then that keyword will be searched in a Search Engine, and the source code of that web page will be downloaded using Java. That webpage will contain many links, and also some documents. If a link is found while searching the source code of that webpage, then its source code will also be downloaded. This can be visualized in the form of a graph as shown in figure 1. Building this graph will help us not to search for same links again. While parsing the source code of the webpage whenever a link is found, the java code for downloading the source will be called again and executed in

177

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

a different thread. Whenever any document is found then the java code for downloading files from web will be called and executed in a different thread. This process may continue forever, so we will restrict the depth of the graph from the starting page. Whenever we are parsing the source code which is at a maximum depth from the starting page then only documents in that web page will be downloaded.

downloading has been done, and finally the server agent will send the zipped folder to the client.

Figure 2. Diagram showing the interaction between Server Agent, Client Agent and Java Codes in Server.

Figure 1. Diagram showing the graph, of the links and documents.

B.

Client Agent This will be a very simple agent, it will perform two tasks. Authentication of the user, it will send a request to the server with the credentials. If the user is authenticated then, the user will be able to perform his task. Provide a simple GUI to the user for sending his keyword, relevant to which the user requires documents (or papers). The user can also directly send the link of the document, or IEEE page in this situation the keyword will not be searched in a search engine, but the server agent will perform the next step directly.

VI. PROTOTYPE DESIGN Figure 3 represents the main JADE prototype elements. An application based on JADE is made of a set of components called Agents each one having a unique name. Agents execute tasks and interact by exchanging messages. Agents live on top of a Platform that provides them with basic services such as message delivery. A platform is composed of one or more Containers. Containers can be executed on different hosts thus achieving a distributed platform. Each container can contain zero or more agents. A special container called Main Container exists in the platform. The main container is itself a container and can therefore contain agents, but differs from other containers as It must be the first container to start in the platform and all other containers register to it at bootstrap time. It includes two special agents: the AMS that represents the authority in the platform and is the only agent able to perform platform management actions such as starting and killing agents or shutting down the whole platform (normal agents can request such actions to the AMS). The DF that provides the Yellow Pages service where agents can publish the services they provide and find other agents providing the services they need.

In figure 2 interactions between Client Agent, Server Agent and Java Codes in Server has been shown. First step is the authentication process, in which client sends the credential to the server agent for verification. If the credentials are verified then the client will be granted access. The client then sends the search keyword to the server agent, which then verifies if the keyword is valid. If it is valid then, the server agent calls the Java code for downloading source code, which will search all the links starting from the mail search page, and process as shown in figure 3. The source code downloader will send the list of all the documents found back to the server agent. Then the agent will send this list to the Document downloader code, which will download all the documents, and save it in a zipped folder ready to be sent to the client. Then it will notify the server agent that the

Agents can communicate transparently regardless of whether they live in the same container, in different containers (in the same or in different hosts) belonging to the same platform or in different platforms (e.g. A and B). Communication is based on an asynchronous message passing paradigm. Message format is defined by the ACL language defined by FIPA [19], an international organization that issued a set of specifications for agent interoperability. An ACL Message contains a number of fields including

178

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

The sender The receiver(s). The communicative act (also called performative) that represents the intention of the sender of the message. For instance when an agent sends an INFORM message it wishes the receiver(s) to become aware about a fact (e.g. (INFORM "today it's raining")). When an agent sends a REQUEST message it wishes the receiver(s) to perform an action. FIPA defined 22 communicative acts, each one with a well defined semantics, that ACL gurus assert can cover more than 95% of all possible situations. Fortunately in 99% of the cases we don't need to care about the formal semantics behind Communicative acts and we just use them for their intuitive meaning. The content i.e. the actual information conveyed by the message (the fact the receiver should become aware of in case of an INFORM message, the action that the receiver is expected to perform in case of a REQUEST message).

In Figure 5, three agents have been shown two clients and one server. Every system that runs a JADE platform will have a main container where all the agents run. The two clients send a request to the server which contains the search query or the link for the paper/document to be downloaded. Host 3: Server Host 1: Client Host 2: Client Host 3 is the server, where the JADE platform will run, there is only one container called the Main Container where along with the server agent two more agents called AMS and DF will run. Name of the server agent is B@Platform2, and it address is http://host3:7778/acc. When the client agents will communicate with the server agent remotely then host3 must be fully qualified domain name. There are two clients Host 1 and Host 2, both this will have a JADE platform with one container called the Main Container where along with client agent there will be two more agents called AMS and DF running. When a user wants to send a request to the server agent, the client agent will send a message to the server agent, where the receiver address in this case will be http://host3:7778/acc and the name of the agent will be B@Platform2, along with other necessary details.

Figure 3. Diagram showing two Client Agents, sending message to the server agent.

VII. CONCLUSION In this work, we have developed an Agent based system for accessing privileged services in a network remotely. The service in this scenario is subscription to IEEE and ACM sites that do not require authentication being inside the network. We have also automated the process of downloading papers/documents from the web that match the search keyword. This application is the first implementation of this type, so there is a lot of scope of improvement in it. We plan to improve the search and give better results, by considering the semantics of the search keyword. This work addresses one such privileged service, this model can be used as a base and expanded to include a lot more of such services and even provide automation wherever possible. REFERENCES
[1] Christopher A. Rouff, Michael Hinchey, James Rash, Walter Truszkowski, and Diana Gordon-Spears (Eds), Agent Technology from a formal perspective (Springer-Verlag London Limited 2006). G. Weiss, Ed., Multiagent systems: a modern approach to distributed artificial intelligence, (MIT Press, 1999). N. J. Nilsson, Artificial intelligence: a new synthesis, (Morgan Kaufmann Publishers Inc., 1998). S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, (Pearson Education, 2003). M. J. Wooldridge, Introduction to Multiagent Systems, (John Wiley & Sons, Inc., 2001). A. Idani, B/UML: Setting in Relation of B Specification and UML Description for Help of External Validation of Formal Development in B, Thesis of Doctorat, The Grenoble University, November 2005. G. W. Brams, Petri Nets: Theory and Practical, Vol. 1-2, (MASSON, Paris, 1982). M-J. Yoo, A Componential For Modeling of Cooperative Agents and Its Validation, Thesis of Doctorat, The Paris 6 University, 1999.

[2] [3] [4] [5] [6]

[7] [8]

179

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

[9]

[10]

[11]

[12] [13]

[14] [15]

[16]

[17]

[18]

[19]

Jiun-Yan Shiau , Xiangyang Li, Modeling the supply chain based on multi-agent conflicts, Service Operations, Logistics and Informatics, 2009. SOLI '09. IEEE/INFORMS International Conference on, Publication Year: 2009 , Page(s): 394 399 Yan Wang, YinZhang Guo, JianChao Zeng, A Study of Logistics System Model Based on Multi-Agent , Service Operations and Logistics, and Informatics, 2006. SOLI '06. IEEE International Conference on, Publication Year: 2006 , Page(s): 829 832 R. Al-Khannak, B. Bifzer, Hezron, Grid computing by using multi agent system technology in distributed power generator , Universities Power Engineering Conference, 2007. UPEC 2007. 42nd International, Publication Year: 2007 , Page(s): 62 67 [Online]. Available:http://en.wikipedia.org/wiki/Workforce_management V. Krishna, V. Ramesh, Portfolio management using cyberagents, Systems, Man, and Cybernetics, 1998. 1998 IEEE International Conference on, Issue Date : 11-14 Oct 1998 Volume : 5 , On page(s): 4860 - 4865 vol.5 Application of Agent Technology to Traffic Simulation. United States Department of Transportation, May 15, 2007. A. Hokimoto, K. Kurihara, T. Nakajima, An approach for constructing mobile applications using service proxies , Distributed Computing Systems, 1996., Proceedings of the 16th International Conference on , Issue Date : 27-30 May 1996 , On page(s): 726 - 733 C. Suriyakala, P.E. Sankaranarayanan, Smart Multiagent Architecture for Congestion Control to Access Remote Energy Meters , Issue Date : 13-15 Dec. 2007 , Volume : 4 , On page(s): 24 - 28 A. Mondal, K. Roy, P. Bhattacharya, Secure and Simplified Access to Home Appliances using Iris Recognition , Computational Intelligence in Biometrics: Theory, Algorithms, and Applications, 2009. CIB 2009. IEEE Workshop on, Issue Date : March 30 2009April 2 2009 , On page(s): 22 29 A. Kundu, A.R. Pal, Tanay Sarkar, M. Banerjee, S. Mandal, R. Dattagupta, D. Mukhopadhyay, An Alternate Downloading Methodology of Webpages , Artificial Intelligence, 2008. MICAI '08. Seventh Mexican International Conference on , Issue Date : 2731 Oct. 2008 , On page(s): 393 398 [Online]. Available : http://www.fipa.org

180

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

An Interactive Multi-touch Teaching Innovation for Preschool Mathematical Skills


Suparawadee Trongtortam
a

Peraphon Sophatsathit and Achara Chandrachaia


Department of Mathematics and Computer Science, Faculty of Science Chulalongkorn University Bangkok, Thailand peraphon.s@chula.ac.th/achandrachai@gmail.com simultaneously, thereby mimicking interactive reality of learning that stimulates high alert attitude.

Technopreneurship and innovation Management Program Chulalongkorn University Bangkok, Thailand Suparawadee.t@student.chula.ac.th

Abstract -The paper proposes a teaching medium that is suitable for preschool children and teacher to develop basic mathematical skills. The research applies the bases of Multi-touch and Multi-point media technologies to innovate an interactive teaching technique. By utilizing Multi-touch and the connectivity structure of Multi-point to create a technology that facilitates simultaneous interaction from child learners, the teacher can better adjust and adapt the lessons accordingly. The benefit of this innovation is the amalgamation of technology and new idea to supporting teaching media development that permits teachers and students to interact to each other directly, as well as self-learning by themselves. Keywords-Multi-touch; Multi-point; preschool mathematical skills; interactive teaching technique.

Figure 1. Multi-touch display and finger movement control

III.

MULTI-POINT

I.

INTRODUCTION

Preschool learning is the first step education that supports child learners in all aspects, e.g., physical, intellectual, professional, and societal knowledge. One of the most urgent and important activity to build their learning is teaching media due to the significant role in disseminating knowledge, experience, and other skills to children. There are numerous teaching media for preschool level, ranging from conventional paper based, transparencies, audio, video, and computer based media. The latter is the principal teaching vehicle which has played an important role owing to its usefulness and convenience. Children can learn by themselves [1, 2] and be independent from classroom environment. This research aims at using the connectivity of Multi-point technique and Multi-touch approach as the platform and underlying research process to develop proper stimulating media for preschool children to learn basic mathematics. The paper is organized as follows. Section 2 and 3 briefly explain Multi-touch and Multipoint technologies. Section 4 describes the proposed approach, followed by the experiments in Section 5. The results are summarized in Section 6. Section 7 concludes with the benefits and some final thoughts. II. MULTI-TOUCH

Multi-point [4, 5] is a multiple computer connection structure developed by Windows [6] for educational institutes or learning centers. It uses one host to support multi-user interface, permitting simultaneous users responses. The underlying configuration is different from conventional client-server (C-S) model in that communication exchange in C-S takes place between client and server in a pair-wise manner. Any exchange among clients is implicitly routed through the server. On the other hand, Multi-point is a simulcast among peer where everyone can see one another simultaneously and interactively. This is shown in Fig. 2. The result of such connectivity scheme is less expenditure, power consumption, easier to manage which is ideal for classroom environment.

Multi-touch [3] is a technology that supports several inputs at the same time to create interaction between the user and the computer. The system responds to finger movement as commands issued by the user, e.g., select, scroll, zoom or expand, etc. Fig. 1 shows multiple fingers touching on several areas of the screen
181

Figure 2. Connectivity of Multi-point scheme

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

This research applies both technologies by connecting several teaching aids with the help of interactive teaching media. The media in turn facilitate simultaneous teachers involvement and childrens interaction. Teachers can teach and observe the students, while the students can react to the lesson promptly. Thus, lessons and practical exercises can be explained, worked out, and corrected on-the-spot. As such, the teacher can design the lesson and accompanying exercises in an unobtrusive and unbounded by physical means. Conventional preschool teaching employs Computer Assisted Instruction (CAI) [7] which provides media in sentences, images, graphics, charts, graphs, videos, movies, and audio to present the contents, lessons and exercises in the form of banal classroom learning. Teaching by CAI can only create interaction between the learner and the computer. On the other hand, the proposed approach instigates and collects responses from several children. The children collectively learn, collaborate, express individuals opinion, and react as they proceed. This in turn stimulates their interest and thought process for better understand and knowledge acquisition.

Figure 3. device connection

Figure 4. Flow of interactive media teaching

Fig. 3 shows the inter-connection of electronic devices for basic preschool mathematics which consists of a Web server controlled by the teacher to observe individual child learning. The exercises are designed and broadcasted via duplex wireless means that allow the student-teacher to interact back and forth collectively at the same time. IV. INTERACTIVE MEDIA TEACHING

Numerous educational media to create learning lessons are prevalent in this digital age. CAI perhaps is a predominant technique being adopted in all levels of teaching. Unfortunately, the-state-of-the-practice falls short of conveying effective teaching that inspires learning toward knowledge. The limitations of CAI technology precludes the teacher and students from interacting to one another simultaneously. Thereby spontaneous thinking and feedback can never be motivated and learned systematically. We shall explore the principal functionality of an interactive teaching innovation.

Fig. 4 illustrates the flow of media set up for interactive teaching. We exploit Multi-point principle to attain higher childrens interaction through latest electronic devices and Multi-point technology. By strategically creating exercise in the form of interactive game to sense the use of multiple fingers touching, their thought process, while stimulating their interests through game playing, the teacher can observe the childrens behavior from their own screen to faster and easier access and respond to the development of each child. Thus, they can promptly monitor, instruct, or sharpen the skill of individual child or the whole group, without having to repeatedly recite the same instruction to every child in the conventional classroom setting. Some of the benefits precipitated from Multi-point principle are: 1. Instant children and teacher interaction through easily understood media of instructions. 2. Flexibility of creating or enhancing teaching media to motivate childrens interests, thereby lessening learning boredom. 3. Strengthen early childhood skills with the help of drawing and graphical illustrations. 4. Increase the speed of cognitive learning in children so as to facilitate subsequent skill development evaluation. We will elaborate how the proposed scheme works out in the sections that follow.
182

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

A. Teacher Preparation Configuration Instructional aids are accomplished via our tool which permits customized display format through simple set up configurations. The teacher can prepare her lessons and companion exercises off-line and upload or post them to the system database. The children will have access to all the materials once upload or posted. Any un-posted instructions, lessons, and exercises will not be accessible by the learners display device. The process flow is depicted in Fig. 5 B. Student Learning Process The process begins with students sign-in to identify himself. He then selects the lesson or exercise set to work on. All the activities are monitored from the teachers console where the results are made available instantly. The process flow is depicted in Fig. 6

The evaluation will adopt three basic indicators given in Table I, namely, Knowledge, Comprehensive and Application to measure the effectiveness of the proposed interactive teaching innovation. This is accomplished via actual preschool class setting by means of CIPP model to be described in the next section.
TABLE I. THREE LEVELS OF EVALUATION BY BLOOMs TAXONOMY. Level Knowledge Comprehension Application Evaluate Able to tell the meaning of positive or negative sign, matching, and shapes. Know how to complete arithmetic operations do exercise by themselves

Figure 5. Preparation process of the teacher

Figure 6. Flow of student learning process

Fig. 5 illustrates the teacher preparation process that proceeds as follows: 1. Select a topic to prepare the lesson. 2. Add or modify the exercises if the exercises are already prepared in the early session. 3. Upload/post the materials in the database. In the meantime, the teacher can monitor the students behavior during the lesson as follows: 1. Select the child to be monitored from list. 2. Observe their work. 3. Assess the results to analyze their behavior and development. C. Skill test by Blooms Taxonomy Learning evaluation is carried out based on Blooms Taxonomy [8] in the following aspects: Media skills test Subject comprehension from doing exercise Self practice
183

D. Learning evaluation by CIPP model This research makes use of CIPP model [9] to evaluate the class performance with respect to the following criteria: score, learning time, degree of satisfaction, and the ratio of learning per expense. The evaluation is performed in accordance with the CIPP capabilities as follows: Context: all required class materials from course syllabus are divided into individual topics and subtopics successively. Each subtopic is further broken down into stories so that subject contents can be presented. The corresponding companion exercise are either embedded or added to the end to furnish as many hands-on drills as possible. Input: the above multimedia lessons are measured to test/monitor the children's skill development, particularly multi-touch drills. The indirect benefits precipitated from this design are duration of work and satisfaction. Process: a number of evaluations are applied through Multi-point and Multi-touch technologies. For example, the time spent on exercise creation and modification, session evaluation, and cost ratio, etc. In addition, interactive monitoring, collaboration, and assistance, instant results display (upon their availability), and information transfer to/from server, etc. The savings so obtained are the utmost achievement of this innovative approach. Product: the instantaneous interaction between children and teacher, and the rate of self-learning upon score improvement, result in tremendous skill improvement and experience in new technology. Thus, both score and user's satisfaction improve considerably. V. EXPERIMENTAL RESULTS

The experiment was run on a Windows-based server that supports two iPad display devices (to be used by a preschool class). The proposed approach focused on a preschool mathematics class, where children learned basic arithmetic operations through interactive visual lesson and exercise. Students retrieved their lesson and corresponding exercises from the Multi-point teaching media system. As the learning progressed, they collaboratively worked on the lessons, exercises, and other activities via the multi-touch system. Their responses were record interactively (including

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

corrections, reworks, etc). The results were instantly processed and made available in the teaching archive. The process is shown in Fig. 7.

Figure 7. Flow of preschool mathematics exercise

the same and different lessons. From the students standpoint, the lesson was designed to observe how students would learn by drawing analogy from the same lesson and accumulate their skills from different lesson. From the teachers standpoint, this would gauge how productive the teacher performed on the same and different lessons. Several measures were collected and categorized according to student and teacher, namely, exercise score (D), duration of work (E), and degree of satisfaction (F), as shown in Table II, and time spent on creating exercise (M), time spent on one session evaluation (N), and ratio of learning per expense (P), as shown in Table III. For example, the exercise score obtained from the students learning the same lesson using CAI is 5 out of 10 as oppose to 8 out of 10 problems via Multi-point. In learning different lessons, the exercise score drops to 1 out of 10 from CAI, but still remains decent at 4 out of 10 problems by Multi-point. Similarly, Multi-point outperforms CAI by one hour for the time spent on creating exercise by the teacher in both cases. The same outcomes hold true for learning per expense where more teachers agree on the effectiveness of Multi-point than CAI approach. The corresponding plots are depicted in Fig. 10-13, respectively.
TABLE II. STUDENT PERFORMANCE EVALUATION Same Lesson Detail CAI D E F 5/10 20 min 9/15 Multi-Point 8/10 13 min 12/15 CAI 1/10 45 min 5/15 Multi-Point 4/10 30 min 9/15 Different Lessons

Figure 8. Flow of design, modification, and monitoring exercise

Figure 10. Students performance Figure 11. Students performance on the same lesson on different lessons Table III. TEACHER PRODUCTIVITY EVALUATION Same Lesson Detail Figure 9. Sample math exercises CAI M N P 4 hr. 60 min 7/15 Multi-Point 3 hr. 20 min 13/15 CAI 4 hr. 85 min 4/15 Multi-Point 3 hr. 25 min 9/15 Different Lessons

Fig. 8 shows the flow of lesson and exercise creation, modification, and monitoring the students activity interactively through the teaching media system. Individual students screen can be selectively monitored, assisted to correct errors or when help is needed, and observed and reviewed their performance via summary on score, frequency of attempts, reworks, etc. All of which are supported by Multi-point technique. Fig. 9 illustrates sample mathematics exercises. We conducted students performance and teachers productivity evaluations to measure the accomplishments of both parties under the proposed system in comparison with conventional CAI system. The evaluations measured two instructional media on
184

Figure 12. Teachers performance Figure 13. Teachers performance on the same lesson on different lessons

From the overall comparative evaluation, it is apparent that the use of Multi-point and Multi-touch technologies is more effective than the conventional CAI approach from both student and teachers

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

standpoint. The obvious initial investment is fully offset by better score, less time, higher satisfaction on the students part, and more production and cost effective on the teachers part. The percentage of agreeable opinion on electronic media adoption is illustrated in Fig. 14.

[5]

[6]

[7]
Figure 14 Percentage of electronic teaching media adoption

VI.

CONCLUSION [8]

We have proposed an interactive teaching innovation for preschool children to improve their mathematical skills. The contributions are two folds, (1) the teacher can instruct and monitor preschool childrens development in real-time, promptly obtaining class evaluation, delivering lessons, and become more economically productive over conventional CAI approach; and (2) preschool children can improve their mathematical skills, or knowledge in general, by interactive means. They will become more enthusiastic to explore new ideas, express themselves, and gain confident and self-esteem as they progress. The proposed approach is simple and straightforward to realize. The underlying configuration exploits Multipoint to simultaneously connect students with the teacher, while interactively furnishes spontaneous communications among them. In the meantime, students can collaboratively work on the exercise to enhance their learning skill via Multi-touch technology. The resulting amalgamation is an innovative scheme which is subsequently implemented as a teaching tool. We targeted at developing their mathematical skills to gauge how the overall configuration will work out. The comparative summaries with conventional CAI turned out to be superior and satisfactory in many regards. We envision that the proposed system can be further extended to operate on larger network scale, whereby wider student audience can be reached. ACKNOWLEDGMENTS We would like to express our special appreciation to teachers and students of Samsen Nok School and Phacharard Kindergarten School for their courteous cooperation and invaluable time for this research. REFERENCES [1] National Education Act, 2542 No 116 At 74a Rajchakichanubaksa, 19 August 1999. [2] Division of Academic and Education Standards, Office of the Elementary Education Commission, MoE. 2546 B.E. Handbook of Preschool Education Age 3-5 years, Ministry of Education, 2546 B.E. [3] Wisit Wongvilai Software Technology of The Future. [online], 2008, http://www.nectec.or.th, [8 July 2010]. [4] Suphada Jaidee. Electronic Learning Media.
185

[9]

[Online], 2007, http://www.microsoft.com/thailand/press/ nov07/partners-in-learning.aspx, [12 July 2010]. Pedro Gonzlez Villanueva, Ricardo Tesoriero, Jose A. Gallud, Multi-pointer and collaborative system for mobile devices , Proceedings of the 12th international conference on Human computer interaction with mobile devices and services, pp 435-438, 2010. Windows. MultiPoint Server 2011. [Online], http://www.microsoft.com/thailand/windows/multi point/default.aspx, [12 August 2011]. Donald L. Kalmey, Marino J. Niccolai, A Model For A CAI Learning System , ACM SIGCSE Bulletin Proceedings of the 12th SIGCSE symposium on Computer science education, vol. 13, Issue 1, pp. 74-77, February 1981. Bloom, B.S et al. Taxonomy of Education Objectives Classification of Education Goals, Handbook I: Cognitive Domain, New York: David Macky, 1972. Stufflebeam, D. L., The CIPP Model for program evaluation. In Maduas, G. F., Scriven, M., & Stufflebeam, D.L. Evaluation Model: Viewpoints on Human Services Evaluation. Boston: KluwerNijhoff Publications, 1989.

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

AUTHOR INDEX
Agarwal, Navin Aditya, Narayan Hati Acharya, Sudipta Bernard, Thibault Bui, Alain Chandrachai, Achara Chaiwongsa, Punyaphat Fung, Chun Che Chongstitwattana, Prabhas Chen, Ting-Yu Dutta, Animesh Upadhyay, Prajna Devi Tran, Hung Dang Smith, Derek H. Hunt, Francis Grachangpun, Rugpong Getta, Janusz Ghosh, Supriyo Haruechaiyasak, Choochart Hiransakolwong, Nualsawat Johannes, Fliege Sil, Jaya Jana, Nanda Dulal Kamsiang, Nawarat Kajornrit, Jesada Wong, Kok Wai Keeratiwintakorn, Phongsak Kongsakun, Kanokwan Kubek, Mario Leelawatcharamas, Tunyathorn Le, Pham Thi Anh Li, Yuefeng Lin, Yung-Chang Minh, Quang Nguyen Muchalintamolee, Nuttida Dewan, Mohammed Quaddus, Mohammed Mehta, Kinjal Minh, Quang Nguyen Salani, Matteo Mingkhwan, Anirach Mandal, Sayantan Meesad, Phayung Montemanni, Roberto Nitsuwat, Supot Nakmaetee, Narisara Pages 176 116 169 14 14 181 145 24, 42 133 30 169, 176, 138 169, 138 75 127 127 70 121 138 58, 70 81 98 116 116 163 24, 42 24 19 42 104 48 157, 75 92 92 75, 157 151 109 109 87 157, 75 127 64 116 36 127 54 58

186

The Eighth International Conference on Computing and Information Technology

IC2IT 2012

Nhan, Le Thanh Ouedraogo, Boukary Paoin, Wansa Pattaranantakul, Montida Sangwongngam, Paramin Sangsongfa, Adisak Senivongse, Twittie Sheth, Ravi Sodanil, Maleerat Sophatsathit, Peraphon Sripimanwat, Keattisak Tansriwong, Kitipong Trongtortam, Suparawadee Upadhyay, Prajna Devi Unger, Herwig Waijanya, Sajjaporn Wolfgang, Benn Wu, Ming-Che Wu, Sheng-Tang Yampaka, Tongjai Yawai, Wiyada Zimniak, Marcin

Pages 157 14 54 8 8 36 48, 151, 163 87 58, 70 187 8 19 181 138, 169 104 64 98 30 92 133 81 98

187

The 9th International Conference on Computing and Information Technology

10-11 May 2013 At Faculty of Information Technology King Mongkuts University of Technology North Bangkok, Thailand www.ic2it.org

Faculty of Information Technology King Mongkuts University of Technology North Bangkok www.it.kmutnb.ac.th

Vous aimerez peut-être aussi