Vous êtes sur la page 1sur 16

Information Sciences 433–434 (2018) 1–16

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

Learning compact zero-order TSK fuzzy rule-based systems


for high-dimensional problems using an Apriori + local search
approach
Javier Cózar∗, Luis delaOssa, José A. Gámez
S.I.M.D. Research Group - Department of Computer Systems - I3 A, University of Castilla-La Mancha, Spain

a r t i c l e i n f o a b s t r a c t

Article history: Learning fuzzy rule-based systems entails searching a set of fuzzy rules which fits the
Received 5 October 2016 training data. Even if using fix fuzzy partitions, the amount of rules that can be formed is
Revised 13 October 2017
exponential in the number of variables. Thus, the search must be carried out by means of
Accepted 23 December 2017
metaheuristics such as genetic algorithms, and sometimes restricted to the set of candidate
Available online 26 December 2017
rules which reach a minimum support.
Keywords: In this article, we propose and evaluate two methods to learn zero-order Takagi–
Fuzzy modelling Sugeno–Kang fuzzy systems in medium-high dimensional domains. First, we introduce the
Evolutionary fuzzy systems minimum individual error of a rule in the criterion for candidate selection. Then, due to
Machine learning the intrinsic locality of fuzzy rule-based systems, where each rule mainly interacts with
Local search adjacent rules, we study the use of local search algorithms to carry out the search of the
final rule base.
Results show that the proposed scheme for candidate rule selection leads to an im-
provement in the results, regardless of the subsequent search algorithm. Moreover, local
search-based algorithms achieve competitive results, reducing substantially the number of
rules of the learnt systems.
© 2017 Elsevier Inc. All rights reserved.

1. Introduction

Fuzzy Rule-Based Systems (FRBSs) [27,28,35] are models based on fuzzy if-then rules. Initially, they gained importance
in the field of control because of the use of linguistic variables [36], which made it easy for experts the elicitation of the
models or some of their parts. However, nowadays they are frequently used for supervised learning (both classification and
regression) [4,8,10,22,33]. The existing algorithms can learn both the fuzzy partitions and the fuzzy rules. However, in order
to reduce the complexity of the learning process, or in some cases improve the interpretability, it is very common to define
the fuzzy partitions manually or by a one-step method [37], so that the learning algorithm only derives the base of fuzzy
rules.
Rule derivation in FRBSs is often addressed as an optimization problem that consists of finding a set of rules and, in
the case of Takagi–Sugeno–Kang (TSK) fuzzy systems, establishing their consequents. As the number of rules that can be
defined is exponential in the number of variables, earlier algorithms as [8] were only applicable in very small domains. The
main strategy adopted to overcome this problem consists in restricting the search to the subset of candidate rules above


Corresponding author.
E-mail addresses: javier.cozar@uclm.es (J. Cózar), luis.delaossa@uclm.es (L. delaOssa), jose.gamez@uclm.es (J.A. Gámez).

https://doi.org/10.1016/j.ins.2017.12.026
0020-0255/© 2017 Elsevier Inc. All rights reserved.
2 J. Cózar et al. / Information Sciences 433–434 (2018) 1–16

a certain support, which can be generated with an adaptation of the Apriori algorithm [1] as proposed in [4]. However,
the Apriori-based method still generates a large number of rules, which are in many cases redundant, and must be fol-
lowed by a screening process, which aims at preventing redundancy, and uses individual quality metrics to select the most
representative rules [4,15].
Once created the set of candidate rules, the learning process must find the subset of them that compose the final FRBS.
This search problem has generally been addressed by population based metaheuristics, such as Genetic Algorithms [19], as
they produced good solutions in terms of error [4,11,32]. However, local search algorithms [23] can benefit from the intrinsic
locality in FRBSs, as a fuzzy rule mainly interacts with those in its neighbourhood. Because of that, this type of algorithms
have been successfully applied in this context [13,14,16], reducing the cost of the search as well. Concretely, the work in
[14] proposes two local search algorithms for finding zero-order TSK rules. In the first one, the neighbourhood operator
generates a new rule base by adding/deleting one rule to/from the original configuration. However, the evaluation implies
recalculating the consequents for the whole set of rules, which is done by means of Least Squares [29]. In contrast, the
neighbourhood operator used in the second algorithm generates a new rule base by changing only one rule, maintaining
the rest of consequents. By avoiding calculating Least Squares, this algorithm aims at gaining speed.
The aforementioned local search algorithms did not consider any candidate selection process. As a consequence, they
could not deal with even medium-sized problems, and were only tested over datasets with a reduced number of variables.
In this work, we use the methodology for candidate rule generation based on the Apriori algorithm [1] that was pro-
posed in [4] for classification problems, and later adapted in [15] for regression. This allows dealing with problems with
larger number of variables. First we study the effect of including the standard deviation of the prediction error as criterion
for candidate selection in the screening process for all algorithms, also considering the frequency-based partitioning as an
alternative to the equal-width partitioning commonly used in similar studies. Afterwards, we study the performance of the
proposed Local Search algorithms, and compare it with the state-of-the-art genetic algorithm proposed in [22], which also
aims at learning the optimal fuzzy partitioning. We present an exhaustive experimentation using 17 medium-high dimen-
sional datasets. Finally, we compare the best algorithms versus PAES-RCS, a recent state-of-the art Mamdani FRBSs learning
algorithm which has proved to build very accurate models. We discuss the balance between accuracy and interpretability
and expose the benefits of using each type of system.
This article is divided into five sections besides this introduction. First, Section 2 introduces both zero-order TSK FRBSs
and the learning process in this kind of systems. Afterwards, Section 3 details the process of candidate rule generation,
and introduces the modified scheme for candidate rule selection, while Section 4 describes the proposed local search based
algorithms for rule derivation. Next, the experimental evaluation is carried out in Section 5. Finally, Section 6 summarizes
some general conclusions and describes some lines for future work.

2. Learning zero-order fuzzy rule-based systems from data: A general overview

2.1. Zero-order TSK FRBSs

In FRBSs, the antecedent of a rule consists of one or several predicates as “X is F”, where X is a variable of the problem,
and F can be any fuzzy set defined over the domain of X [35]. In many cases, as it is the case of this work, the domain of X
is partitioned in a finite number of fuzzy sets, A, and rules are defined with predicates “X is A”, with A ∈ A [28]. Optionally,
each fuzzy set A can be associated to a linguistic label so that the system is legible by human experts (e.g. Temperature is
High). In such cases, X is referred to as a linguistic variable [36].
There are two main types of FRBSs: Mamdani and Takagi–Sugeno–Kang (TSK). They differ in both the representation of
the rules and the inference process. In Mamdani FRBSs [27,28], the consequents of the rules are also fuzzy sets. Therefore, a
rule, Rs , is defined as:

Rs : If X1 is As1 & . . . & Xi is Asi & . . . & Xn is Asn then Y = Asy ,

where Xi and Y are domain variables, Asi ∈ Ai , and Asy ∈ Ay .


In Mamdani-type inference, an input example el = (xl1 , . . . , xln ), fires each rule Rs such that ∀i=1...n μAs (xi ) >  , being μAs
i i
the membership function corresponding to the fuzzy set Asi . When Rs is fired by the example el , it generates as output a
   
fuzzy set, Rls , defined over the domain of Y, as μRl (y ) = min μAy (y ), hs , where hls = T μAs (xl1 ), . . . , μAsn (xln ) , is the match-
l
s 1
ing degree of instance el with the rule Rs , and T is a t-norm.1
As the fuzzy sets corresponding to each variable usually overlap, an input example generally fires several rules whose
outputs must be aggregated into a single value. There are different approaches to do that. The FITA (First Integrate Then
Aggregate) method [12], first defuzzifies (e.g. with the center of gravity) each output Rls , into its corresponding numerical
value, vls ; then aggregates them by using an average weighted by the matching degrees. Thus, given a rule base RB , the

1
In this work, we consider the min as t-norm in all cases.
J. Cózar et al. / Information Sciences 433–434 (2018) 1–16 3

predicted output for the example el is obtained as:



R ∈RB |hls > hls vls
yˆl = s , (1)
Rs ∈RB |hls > hls

where  ( ≥ 0) determines the minimum matching degree which fires a rule.


In TSK FRBSs [31] the consequent of each rule Rs , Ps (X1 , . . . , Xn ), is a polynomial function of the input variables. Therefore,
a rule Rs is defined as:

Rs : If X1 is As1 & . . . & Xi is Asi & . . . & Xn is Asn then Y = Ps (X1 , . . . , Xn ).


The inference in TSK models consists in aggregating the individual outputs produced by each rule Rs when processing an
example el (numerical), as an average weighted by the matching degree, hls :

l Rs ∈RB |hls > hls Ps (xl1 , . . . , xln )
yˆ =  . (2)
Rs ∈RB |hls > hls

The order of a TSK FRBSs refers to the degree of the polynomials that can be used in the consequents of the rules, which
can be either a constant (zero) or a linear combination of the inputs (one). This article focuses on zero-order TSK systems,
where the rules are expressed as:

Rs : If X1 is As1 & . . . & Xi is Asi & . . . & Xn is Asn then Y = bs ,


where bs ∈ R. Therefore, the output produced by an example el when processed by a zero-order TSK system can be reduced
to:

l R ∈RB |hls > hls bs
yˆ = s . (3)
Rs ∈RB |hls > hls

2.2. Learning TSK-0 FRBSs: a general overview

FRBSs learning entails finding both the definition of the fuzzy partitions and a set of rules. It is generally adressed as a
combinatorial optimization problem with heterogeneous configurations (variable number of predicates in the antecedents)
which codify both components of the model [4,11,22]. Each candidate solution represents a FRBS, and its fitness corresponds
to their prediction error over the training data. In this work, we use the Mean Squared Error (MSE) to compute the fitness
of the solutions:
N
1 (yˆl − yl )2
MSE (E ) = · I=1
2 N
where N is the number of instances in E, yˆl is the prediction for el and yl is the real output value.
The learning process of FRBSs involves a large number of variables which are related among themselves: the performance
of a fuzzy rule depends on the data base definition, and vice versa, the goodness of a data base depends on the rule
base which is being used to evaluate the system. One of the most extended methodologies, called Genetic Fuzzy Systems
(GFSs) [9,11], consists in using genetic algorithms to carry out the learning process. The advantage of using GAs is two-fold:
1) it is a global search algorithm, which explores simultaneously changes in data base and rule base definitions; and 2)
multiobjective evolutionary algorithms (MOEAs) can be used to find solutions in the pareto front, allowing to select from
very accurate models to high interpretable ones. Genetic algorithms have been widely applied in the literature resulting in
successful learning algorithms for Mamdani [3,5] and TSK [22,30] FRBSs. A taxonomy of GFSs can be found in [18].
In many other cases, as it is the approach followed in this work, the specification of the fuzzy partitions, A =
{A1 , . . . , An } is given or obtained in a straightforward way. Either as a symmetrical partition (equal width triangular fuzzy
sets), or by a one-step method based on discretization plus creation of fuzzy sets [37]. In such case, the learning algorithms
only search for the set of rules [4,8,10,11]. In the case of TSK models, the search involves assigning values to the consequents.
Although these parameters can be part of the configuration representing a FRBS [22], it is possible, given a set of rules, to
obtain the optimal combination of consequents, i.e., the one that reduces the training error, by means of the Least Squares
method [29].
Due to both the variable size of the rules, and the use of overlapping fuzzy sets, the number of rules that can be gen-
erated from a dataset is still intractable, and some methods restrict the search to a relatively small set of promising and
representative candidate rules, which are generated in a first stage, allowing subsequent stages of the algorithms to search
for the final rule base [4,15]. As great complexity (a large number of rules) can produce overfitting, the aim of the algorithms
is to find small sets of general rules. This is done by both limiting their size, and filtering by their support. Nevertheless,
even if considering this filtering, fuzzy rule learning algorithms scale badly if compared with other machine learning algo-
rithms (e.g. c4.5), and dealing with really high dimensional datasets requires an additional step of variable selection, as the
one described in [6], leaving just the more important features for the following learning stage.
4 J. Cózar et al. / Information Sciences 433–434 (2018) 1–16

3. Generation of candidate rules with Apriori + screening

The amount of distinct rules that can be generated from a set of fuzzy partitions is exponential on the number of
variables. Even if considering only those rules that match any instance in the training dataset, one instance enables at least
2n − 1 rules.2 This problem is critical, and it becomes necessary to pre-select a set of candidate rules which is tractable in
subsequent steps of the learning algorithms.
In favor of generalization, the learning process should generate a small set of simple and general rules. This implies
discarding those rules not matching a minimum proportion of training examples, as well as those which are redundant.
The method proposed in [4] adapts the well-known Apriori algorithm for frequent itemset detection [1] to extract candidate
rules with a minimum support, and then uses a screening process to select the most representative of them. Next, both
steps are described in detail.

3.1. Finding relevant candidate rules with Apriori

Let Es = {el ∈ E | hls > 0}, i.e., the set of examples in the training dataset which match the rule Rs . The support of Rs is
calculated as the sum of the matching degrees for each el ∈ Es :

support (Rs ) = μAsi (xli ) · . . . · μAsj (xlj ) .
el ∈Es

A rule is frequent if its support is equal or greater than an established threshold, namely minimum support. If a rule, Rs , is
not frequent, neither is any of the rules generated by adding predicates to Rs , which therefore do not need to be generated.
This anti-monotone property of the support is the basis of the well-known Apriori algorithm for frequent itemset detection
[1], which was adapted in [4] for finding frequent fuzzy rules. Algorithm 1 details a non optimized version of this method.

Algorithm 1: Apriori algorithm for fuzzy rule generation.


Input:
A: Fuzzy partitions of the input variables
E : Training dataset
min_support : Minimum support
max_size : Maximum size (number of predicates) of rules
Output:
RB ap : Set of candidate rules produced

// Finds fuzzy predicates with min_support and rules with size 1


sz = 1;
FP = ∅ ;
CRsz = ∅ ;
for Ai ∈ A do
if support ( “Xi is Ai ” ) ≥ min_support then
FP = FP ∪ “Xi is Ai ”;
CRsz = CRsz ∪ [Xi is Ai ];
RB ap = CRk

// Iteratively builds the rules with support ≥ min_support


while sz < max_size, or CRsz = ∅ do
sz + + , CRsz = ∅;
foreach R ∈ CRsz−1 do
foreach “Xi is Ai ” ∈ FP | “Xi is · ” ∈ / R do
if support ( [R & “Xi is Ai ”] ) ≥ min_support then
CRsz = CRsz ∪ [R & “Xi is Ai ”];
RB ap = RB ap ∪ CRsz ;

In a first step, it obtains the set of frequent predicates, FP , and generates the set of frequent rules with size one, namely
CR1 . Then, in an iterative process, generates the set of frequent rules with size sz, CRsz , by extending the frequent rules
with size sz − 1 with predicates in FP (if possible), and discarding those new rules that are not frequent.

2
In the case that one instance activates only one fuzzy set per feature, it enables 2n − 1 heterogeneous rules.
J. Cózar et al. / Information Sciences 433–434 (2018) 1–16 5

Fig. 1. Structure of the candidate rule generation tree.

In order to reduce the complexity of the models, the depth of the tree can be limited, allowing only candidate rules with
a maximum number of antecedents, that is determined by the parameter max_size.
The aforementioned process can be optimized by building a tree (Fig. 1) in which a node (excluding the root) represents a
fuzzy predicate, and the path from the root to each node represents a rule. Therefore, a path cannot contain nodes involving
the same variable, and the depth of each node indicates the number of antecedents of the rule it represents. When the rule
corresponding to a node is not frequent, such node is removed (and therefore not expanded).

3.2. Filtering candidate rules with screening

The size of the rule set returned by Apriori, RB ap , can be still be high and unmanageable, even for small problems.
Furthermore, many of such rules are usually redundant, and cover similar sets of examples. As a consequence, it is necessary
to carry out a subsequent screening aimed at preserving only the most representative rules in RB ap .
The work in [4] describes an screening method for classification fuzzy rules, which was later adapted for regression
with zero-order TSK rules [15]. Given a set of candidate rules, RB c , each element el is assigned a weight that is inversely
proportional to the number of rules in RB c that it matches with:
1
w(el , RB c ) =   .
 {Rs ∈ RBc | hls > 0}  + 1
The weighted relative support (wrSupport) of a rule Rs given RB c , uses these weights. Therefore, it penalizes those rules
matching examples which also match other rules in RB c , i.e., penalizes redundant rules. It is expressed as:
 w(el , RB c ) · hl
wrSupport (Rs , RB c ) = s

e ∈E
w(el , RB c )
l

Besides avoiding redundancy, it is convenient to filter candidate rules by their individual performance. In this respect, the
work in [15] quantifies the performance of each rule as the minimum error that this rule could induce if used in isolation.
In the case of a zero-order TSK rule, Rs , this corresponds to the standard deviation of the output values for the examples in
E s (those matching by the rule), which will we denote as SD(Ys ).
Considering both factors, and normalizing with z1 and z2 , the individual performance of a fuzzy rule in the context of a
set of rules, RB c , can be quantified as:
   1

IP (Rs , RB c ) = z1 wrSupport (Rs , RB c ) · z2 (4)
SD(Y s )
The screening method starts by considering an empty set of candidate rules, (RB c = ∅ ), and the whole training dataset
E. At each iteration, the algorithm selects the fuzzy rule Rs ∈ RB ap which maximizes IP (Rs , RB c ); incorporates it to RB c ;
and updates the weights of the examples, w(el , RB c ) ∀el ∈ E s . Afterwards, discards those examples that match at least kt
rules in RB c so that they are no longer considered in the calculation of the weights. The process finishes when there is no
example left, i.e., when all examples in E have been covered by at least kt rules. Algorithm 2 shows the pseudocode of the
method.

4. Derivation of the rule base with local search algorithms

Given a set of candidate zero-order TSK rules, it is possible to obtain the values of the consequents that minimize the
training error with the Least Squares method [29]. However, including the whole set of candidate rules in the final rule base,
and setting the consequents to their optimal values, can produce overfitting. Furthermore, the number of rules, as well as
6 J. Cózar et al. / Information Sciences 433–434 (2018) 1–16

Algorithm 2: Screening algorithm.


Input:
RB ap : Set of rules returned by Apriori
E : Training dataset
kt : Maximum coverage of an instance before removal

Output:
RB c : Set of rules returned

RB c = ∅;
w(el , RB c ) = 1 , ∀el ∈ E ;
while E = ∅ do
Rs = arg maxRs ∈RBap IP (Rs , RB c );
RB ap = RB ap − {Rs };
RB c = RB c ∪ {Rs };


E ← E − el  | {Rt ∈ RB c | htl > 0} | ≥ kt ;

update w(el , RB c ) ∀el ∈ E s ;

the number of predicates in the antecedent of each rule, are factors that affect the interpretability of the models [21]. For
both reasons, rule derivation algorithms should discard as many candidate rules as possible, whilst preserving the accuracy.
As the performance of a FRBSs depends on the interaction of the whole rule base, the algorithms do not consider rules
individually. Instead, they focus on searching subsets of candidate rules [8]. This search can be stated as a combinatorial
optimization problem, where each configuration represents a rule base, and its fitness corresponds to the error of such rule
base over the training dataset. Population based metaheuristics, such as Genetic Algorithms [24], are frequently used to
learn FRBSs [9,11,22]. Nevertheless, the cost of the search with these methods can be too high, as they require to evaluate a
large amount of configurations.
The results of [14,16] show that it is possible to improve the search process by using local search algorithms to determine
the final subset of candidate rules. Besides needing a significantly smaller number of evaluations, this approach allows
further improvements on some implementations. On the other hand, the starting point biases the result of local search
algorithms. This fact can be taken advantage of in order to reduce the number of rules of the final solution.
In this paper, we test our two proposed local search-based methods for zero-order TSK rule derivation. The first one
of them, Local Search Least Squares carries out the search in the space of subsets of rules, and evaluates each candidate
subset by fixing the whole set of consequents with Least Squares. Although at each step only one rule changes (is added
or deleted) this affects to the entire system, and becomes necessary to process the whole training dataset to evaluate each
configuration. Because of that, and in order to improve the efficiency by taking advantage of locality, we propose another
algorithm, Fast Local Search, where only one rule (one consequent) is calculated at each step, allowing a fast evaluation of
the neighbourhoods.
Next, we describe the Least Squares method, as it is part of all the proposed algorithms. Afterwards, we describe each
one of the two local search algorithms in detail.

4.1. Least squares

The problem of fixing the consequents of a set of zero-order TSK rules which minimize the squared error over the
training dataset can be stated as a linear regression one, and solved by using the Least Squares method [29].
The output when processing an example el , given a rulebase RB , is computed as shown in Eq. (3). This operation can be
vectorized as:
⎡ ⎤
b1
l
yˆo = c1l · · · cRB
l ⎣ .. ⎦
.
b|RB|

where csl is:

hls
csl = 
RB hls
J. Cózar et al. / Information Sciences 433–434 (2018) 1–16 7

Using all the N examples in the training data set, the previous expression turns into the following:

⎡ ⎤ ⎡ ⎤
y1 c1l
··· c|1RB| ⎡ b1 ⎤
Y =C·B=⎣ .. ⎦ = ⎢ .. .. .. ⎥⎣ .. ⎦
. ⎣. . . ⎦ .
yN cNl ··· N
c|RB| b|RB|

The matrix of consequents B can be obtained as B = C −1Y . However, as C is not invertible, we use the Moore–Penrose
pseudoinverse, computing B = (C
C )−1C
Y . The inverse of C
C is obtained by applying the Singular Value Decomposition
method [26].

4.2. Local search least squares (LSLS)

This algorithm uses a basic local search algorithm (hill climbing) to learn the rule base. As in population based algorithms,
each configuration represents a subset set of candidate rules (for example, with a binary string with length |RB c |), and is
evaluated by calculating their consequents with Least Squares, and measuring the mean squared error over E.
Given a candidate configuration RB c , its neighbourhood N (RB c ) is defined as the set of rule bases obtained by adding or

deleting just one rule to RB c . That is, RB c ∈ N (RB c ) if ∃! Rcs ∈ RB c |Rcs = Rcs ∈ RB c . The pseudocode of Local Search Least
Squares is described in Algorithm 3. The method takes an initial solution as starting point and, in each iteration, moves to

Algorithm 3: Local Search Least Squares algorithm.


Input: RB i , E : Initial rule base and training dataset
Output: RB o : Resulting rule base

RB o = RB i ;
Erroro = MSE(RB o, E ) ;
repeat
RB c ← RB o;
foreach RB c ∈ N (RB c ) do
LeastSquares(RB c , E );
Errorc = MSE(RB c , E ) ;
if Errorc < Erroro then
RB o ← RB c ;
Erroro ← Errorc ;
until Erroro does not improve;
return RB o;

the best neighbour (in terms of MSE). The process stops when there is no improvement.

4.3. Fast local search

One of the drawbacks of the Local Search Least Squares algorithm is efficiency. Even if each neighbour of a rule set is
generated by adding/removing only one rule, it is necessary to carry out the Least Squares procedure to obtain the conse-
quents of the whole set. As this procedure involves obtaining the pseudo-inverse matrix, the algorithm could be excessively
time-consuming. Furthermore, the evaluation of the new configuration implies processing the whole dataset E.
The Fast Local Search method is based on a much more efficient approach, which considers the intrinsic locality in FRBSs,
and makes learning viable even in high dimensional problems. Given a candidate configuration RB c , this algorithm obtains
its neighbourhood N F (RB c ), by applying two possible operations to each rule Rcs ∈ RB c :

• Removing a rule Rcs ∈ RB c . Maintaining the rest of rules and their consequents.
• Assigning to each Rcs the optimal value for its consequent, given the rest of rules in RB c . In this case, if Rcs ∈
/ RB c , this
operation implies adding the rule.

The cost of both operations is reduced. Whereas removal does not imply any additional calculation, it is possible to
obtain the optimal consequent of a rule Rcs ∈ RB c as stated in [14]:
8 J. Cózar et al. / Information Sciences 433–434 (2018) 1–16


  
 
2yl Hcl hs l 2HBlc−s hls
el ∈E s 2 − el ∈E s 2
Hcl Hcl
bcs = 
 2
 ,
2hls
el ∈E s 2
Hcl
 
where Rl is the set of rules covering el , Hcl = Rtc ∈Rl htl , and HBlc−s = Rtc ∈Rl |Rtc =Rcs htl bt .
As for the evaluation of a neighbour, the application the neighbourhood operator over Rcs only affects the outputs of
el ∈ E s in relation with the outputs produced by the original configuration RB c . That allows implementing a new method,
MSEF (RB c , E ), which does not need to reprocess the whole dataset E.
Algorithm 4 shows the basic scheme of Fast Local Search. Although it also follows the basic algorithm for local search,

Algorithm 4: Fast Local Search algorithm.


Input: RB i , E : Initial rule base and training dataset
Output: RB o : Resulting rule base

RB o = RB i ;
Erroro = MSE(RB o, E ) ;
repeat
RB c = RB o;
foreach RB c ∈ N F (RB c ) do
Errorc = MSEF (RB c , E ) ;
if Errorc < Erroro then
RB o ← RB c ;
Erroro ← Errorc ;
until Erroro does not improve;
return RB o;

there are three main differences in relation with the Least Squares Local Search method shown in Algorithm 3. First of all,
the definition of the neighbourhood changes to N F ; secondly, there is no need to calculate least squares for each neighbour;
and last, the calculation of the MSEF has been optimized.
Besides the aforementioned advantages related with efficiency, this algorithm would allow further improvements. Let
Ns (RB c ) be the subset of (two) neighbours of RB c obtained by applying the neighbourhood operators to the rule Rcs ; and let
RB c (s ) ∈ Ns (RB c ) be one of such neighbours. It is not necessary to evaluate all elements in N (RB c (s ) ). In fact, if E s ∩ E t = ∅,
there is no need to re-calculate any Nt (RB c (s ) ). If RB c (t ) ∈ Nt (RB c ), and RB c (t ) ∈ Nt (RB c (s ) ):

MSE (RB c (t ) , E t ) − MSE (RB c , E t ) = MSE (RB c (t ) , E t ) − MSE (RB c (s ) , E t ),


as the change in the rule Rcs does not affect the outputs for the examples covered by Rtc .
Lastly, the algorithm may perform an extremely large number of iterations. This is due to the fact that the optimum
consequent (bsc ) for a rule which is included in the rule base can be very similar to its previous consequent bs . In fact, there
is a point after which the improvements made are insignificant, and evaluations barely reduce the error. We have faced
this problem avoiding evaluating neighbours when the difference between the best consequent for a certain rule, bs , and
c
its current consequent bs does not reach a minimum threshold. We have fixed this threshold at the 5% of the range of the
output variable.

5. Experimental study

The experimental study has been divided into three stages. The first one is focused on the generation of candidate rules
and aims at evaluating our alternative criterion for screening. After determining the best method for candidate rule gener-
ation, we have used it to carry out a comparison among the rule derivation algorithms and METSK-HDe , a recent state-of-
the-art TSK-0 GFS, which tunes the data base and the rule base. Finally, we will compare the best algorithms in the previous
phase and a state-of-the-art learning algorithm called PAES-RCS [5], which has been proved to be a very accurate algorithm
to learn Mamdani FRBSs. In this stage, we will discuss the balance between performance and complexity for Mamdani and
TSK-0 FRBSs. Next, we detail the experimental settings and the analysis of the three experimental stages.

5.1. Experimental settings

For the experimental evaluation of the proposed algorithms, we used 17 regression problems from the KEEL repository
[2]. These were selected among those available attending to their number of features. The datasets range from 8 to 40
variables, as some of the studied algorithms can not deal with higher dimensional domains. Nevertheless, these sizes are
still important in FRBS literature [5]. Table 1 shows the properties of each dataset.
J. Cózar et al. / Information Sciences 433–434 (2018) 1–16 9

Table 1
Number of variables and instances per dataset.

# Vars. concrete abalone california stock wizmir wankara mv forest treasury


8 8 8 9 9 9 10 12 15

# Inst. 1030 4177 20,640 950 1461 1609 40,768 517 1049
mortgage baseball house elevators compactiv pole puma32h ailerons
# Vars. 15 16 16 18 21 26 32 40
# Inst. 1049 337 22,784 16,599 8192 14,998 8192 13,750

Fig. 2. Two alternatives for the data base generation.

Table 2
Common parameters for candidate rule genera-
tion.

Minimum support 0.05 · N


kt 20
Maximum number of antecedents 3

As this study focuses on rule derivation algorithms, the fuzzy partitions are fixed prior to the learning process. In related
works, it is common to use equal width partitions with triangular fuzzy sets. However, this approach may not be adequate
when data is not equally distributed along the variable domain, and the number of instances covered by each fuzzy set is
unbalanced. This is the case in Fig. 2a. An alternative, described in [37], is the use of equal frequency partitions so that each
fuzzy set covers the same number of instances (Fig. 2b), allowing more granularity when needed.
In relation with the rule derivation methods, and after a set of preliminary experiments, we used the empty rule base
as starting point for the local search based algorithms. The reason for that is twofold: First of all, search is biased towards
to solutions with a small number of rules; moreover, finding the optimal set of consequents with Least Squares for systems
with a large number of candidate rules can be prohibitive, in terms of time, if it has to be done repeatedly, as it is the case
of the Local Search Least Squares algorithm.
Last, as baseline, we have compared the results of the local search algorithms with both Least Squares, and one recent
implementation of a genetic algorithm for building a FRBS in regression problems, called METSK-HDe , described in [22].
In the latter, the fuzzy partition is evolved together with the rulebase definition. Therefore, the one-step fuzzy partition
derivation is only used by the other algorithms, and the contrast between both approaches (equal width and frequency
partition) is done by comparing the results of those search algorithms.
In order to carry out the comparisons, we report the average MSE of a 5-fold cross-validation, using the same folds as
in [22] (obtained from KEEL). In the case of non-deterministic algorithms (only METSK-HDe ) we report the average of 6
independent executions (each one a 5-fold cross-validation) using different seeds.

5.2. Experiments on the generation of candidate rules

The first goal of this work consists of determining whether the individual error of a rule – expressed in the second term
of Eq. (4) as the standard deviation of the outputs for the examples matched – is a valid criteria for candidate selection.
In order to do that, we have tested both alternatives (namely SD and None) for rule candidate generation. Moreover,
as the definition of the fuzzy partitions may have incidence in the results, we have obtained results for both equal width
(Width) and equal frequency (Freq) partitioning.
We analyzed the configurations for each one of the three rule derivation algorithms: Least Squares (LS), Local Search
Least Squares (LSLS) and Fast Local Search (FLS). In all cases, and based on previous studies, we fixed the parameters of
Apriori+screening shown in Table 2.
10 J. Cózar et al. / Information Sciences 433–434 (2018) 1–16

Table 3
Wins/ties/losses between the four parametrizations of the dif-
ferent methods.

Algorithm Width-SD Freq-None Width-None

(a) Least Squares


Freq-SD 13/0/4 13/0/4 12/1/4
Width-SD - 9/1/7 10/0/7
Freq-None - - 9/0/8
(b) Local Search Least Squares
Algorithm Freq-None Width-None Width-SD

Freq-SD 15/0/2 13/0/4 15/0/2


Freq-None - 9/0/8 9/1/7
Width-None - - 8/0/9
(c) Fast Local Search
Algorithm Width-SD Freq-None Width-None

Freq-SD 12/1/4 13/1/3 15/0/2


Width-SD - 11/1/5 14/0/3
Freq-None - - 12/0/5

Table 4
Training and test MSE for the Least Squares method using different parametrizations for the candidate rule generation.
Results should be multiplied by 10+3 , 10+1 , 10+9 , 10+8 , 10−6 , 10−4 or 10−8 in the case of baseball, forestFires, california,
house, elevators, puma32h or ailerons respectively.

problem Freq-None-LS Width-None-LS Freq-SD-LS Width-SD-LS

train test train test train test train test

abalone 2.3071 2.4389 2.1352 2.2585 2.3633 2.5238 2.3304 2.5617


ailerons 5.3732 5.4079 7.2590 7.2377 4.6531 4.6874 7.9358 7.9867
baseball 261.7531 364.8455 222.9756 426.6626 198.1051 360.4661 191.3739 352.8633
california 2.3121 2.3561 2.0747 2.0830 2.4234 2.4535 2.1121 2.1329
compactiv 80.3330 81.7453 27.6436 28.0367 53.6934 56.1035 23.0224 24.1361
concrete 14.8491 18.4142 22.5819 27.4573 13.8325 18.4295 23.1070 27.6895
elevators 20.0 0 0 0 20.0 0 0 0 12.0 0 0 0 14.0 0 0 0 14.0 0 0 0 14.0 0 0 0 20.0 0 0 0 20.0 0 0 0
forestFires 184.0076 215.3659 189.6632 202.7618 180.5597 218.4974 187.3557 209.4896
house 7.3664 7.4227 10.1019 10.1916 7.0352 7.10 0 0 9.7561 9.8542
mortgage 0.0502 0.0706 0.1324 0.1819 0.0203 0.0356 0.0443 0.0848
mv 2.6336 2.6649 2.5495 2.5656 0.4126 0.4110 0.7178 0.7144
pole 799.6310 805.5906 606.7114 607.6415 554.0515 559.7320 693.6948 694.6981
puma32h 4.4200 4.4800 4.3400 4.4400 2.6400 2.7200 3.1800 3.2400
stock 0.8931 1.0849 0.9165 1.1105 0.6054 0.7685 0.5907 0.8279
treasury 0.0477 0.0684 0.2542 0.4363 0.0343 0.0440 0.0791 0.1540
wankara 1.4714 1.6901 9.2727 11.3442 0.9944 1.2175 1.1711 2.9680
wizmir 1.7685 2.0930 7.3305 10.3342 0.7881 1.0420 1.3722 1.9651

Tables 4–6 show the training and test MSE for each problem and search method. As can be observed, introducing the
standard deviation of the rule in the criterion used in the screening process (SD) clearly leads to better results; most of
them also when using the equal frequency (Freq) partitioning. Table 3 shows a pairwise comparison between each scheme
in which the win/tie/loss metric is reported for each method. As can be seen, the Freq-SD configuration outperforms the
rest in all cases. This is presented more clearly in Fig. 3, which shows the box plots for the ranks obtained with each one of
the rule derivation algorithms.

5.3. Experiments on the methods for rule derivation

In this subsection, we compare METSK-HDe , LS and the proposed local search algorithms. As mentioned, these algorithms
(except METSK-HDe ) must both select a set of candidate rules and find the values for the consequents. For the experi-
ments, and based on the results of the previous subsection, we used the best parametrization for candidate rule selection:
frequency-based fuzzy partitioning and the scheme for candidate rule screening that considers standard deviation.
Table 8 shows the average MSE for training and test, and Fig. 4 shows the ranking distribution for MSE in test. As can
be appreciated, METSK-HDe is the method which obtains the smallest error in most cases, and therefore obtains the lowest
mean rank. In contrast, LS never obtains the best result. In order to compare the different approaches, we carried out a sta-
tistical study suggested in [17], and performed using the software tool exReport [7], using α = 0.05 as confidence level. First
of all, we applied a Friedman test [20] to check whether there are or not statistical differences among the proposed meth-
ods. The output shows that differences exist (p-value equals 5.13 · 10−6 ≤ 0.05). After that, we applied a post-hoc test based
on the mean ranks, comparing the control method (the best in the mean rank) against the rest. After that, we corrected the
J. Cózar et al. / Information Sciences 433–434 (2018) 1–16 11

Table 5
Training and test MSE for the Local Search Least Squares method using different parametrizations for the candidate
rule generation. Results should be multiplied by 10+3 , 10+1 , 10+9 , 10+8 , 10−6 , 10−4 or 10−8 in the case of baseball,
forestFires, california, house, elevators, puma32h or ailerons respectively.

problem Freq-None-L SL S Width-None-L SL S Freq-SD-L SL S Width-SD-L SL S

tra.err tes.err tra.err tes.err tra.err tes.err tra.err tes.err

abalone 2.2467 2.3460 2.1304 2.2550 2.3376 2.4757 2.3213 2.5382


ailerons 5.2422 5.2678 7.2196 7.1889 4.5049 4.5806 7.9028 7.9165
baseball 226.4047 319.3569 198.9980 319.4074 175.4569 311.9695 176.2100 326.4885
california 2.2202 2.2709 2.0226 2.0196 2.3656 2.3854 2.0098 2.0158
compactiv 77.6988 77.2027 26.0842 26.2192 59.2590 61.9661 29.8612 31.5240
concrete 13.7460 18.4038 21.4605 26.3715 12.9074 17.0438 22.1377 26.8794
elevators 20.0 0 0 0 20.0 0 0 0 12.0 0 0 0 12.0 0 0 0 10.0 0 0 0 10.0 0 0 0 20.0 0 0 0 20.0 0 0 0
forestFires 116.6246 267.5523 168.0847 221.2927 100.8088 231.7140 161.9533 295.4768
house 7.1659 7.2170 10.8459 10.8904 6.7443 6.8472 9.6114 9.6741
mortgage 0.0295 0.0342 0.1069 0.1217 0.0212 0.0256 0.0518 0.0574
mv 0.7079 0.7111 1.6625 1.6814 0.2750 0.2734 1.2164 1.2313
pole 799.3666 804.6649 606.6906 607.6719 535.5939 541.8291 692.0380 692.9817
puma32h 4.4200 4.4800 4.2200 4.2800 1.50 0 0 1.5200 1.9200 1.8600
stock 0.8898 1.1004 0.8014 0.9661 0.5661 0.7248 0.5668 0.7561
treasury 0.0494 0.0581 0.2827 0.3406 0.0518 0.0537 0.0512 0.0670
wankara 1.9883 2.1496 6.3552 6.6761 0.8882 1.0246 0.9448 1.0440
wizmir 2.4570 2.5731 5.2252 5.2748 0.8159 0.9550 1.3259 1.5090

Table 6
Training and test MSE for the Fast Local Search method using different parametrizations for the candidate rule gener-
ation. Results should be multiplied by 10+3 , 10+1 , 10+9 , 10+8 , 10−6 , 10−4 or 10−8 in the case of baseball, forestFires,
california, house, elevators, puma32h or ailerons respectively.

problem Freq-None-FLS Width-None-FLS Freq-SD-FLS Width-SD-FLS

tra.err tes.err tra.err tes.err tra.err tes.err tra.err tes.err

abalone 3.0553 3.0842 3.2844 3.3026 3.0663 3.0830 2.9792 3.0094


ailerons 5.5580 5.5626 7.4599 7.3900 5.5815 5.5670 8.1273 8.1212
baseball 290.6959 313.1281 290.7852 326.1132 233.6245 302.9229 231.5379 305.9670
california 4.0316 4.0264 4.2634 4.2929 3.8866 3.9034 3.5782 3.5800
compactiv 113.9184 114.2434 142.0446 142.1951 93.5328 94.0234 80.1820 80.0981
concrete 48.7212 51.9840 62.5363 64.0971 41.3264 44.0964 59.8523 64.8027
elevators 20.0 0 0 0 20.0 0 0 0 12.0 0 0 0 14.0 0 0 0 20.0 0 0 0 20.0 0 0 0 20.0 0 0 0 20.0 0 0 0
forestFires 165.7130 226.8535 188.1627 199.6680 168.4253 209.4564 190.7464 198.4508
house 9.3753 9.3265 13.0519 13.0967 9.7023 9.7707 12.7656 12.8104
mortgage 0.0622 0.0691 0.3916 0.4027 0.0567 0.0670 0.1971 0.1997
mv 4.1222 4.1008 3.8342 3.8413 2.0222 2.0254 2.1325 2.1483
pole 842.7329 843.1353 836.2952 837.4054 652.2413 656.7558 808.7412 809.3857
puma32h 4.4600 4.4800 4.2400 4.2600 1.9600 1.9600 2.4200 2.4200
stock 4.2211 4.5373 5.7248 5.9723 1.8670 2.0974 2.7106 2.9699
treasury 0.0817 0.0906 0.5457 0.5855 0.1051 0.1140 0.1261 0.1336
wankara 5.5687 5.8217 10.3902 9.8313 2.0471 2.0351 2.0 0 09 2.1742
wizmir 3.6431 3.8348 7.3726 7.2930 1.6575 1.8513 2.4325 2.7746

Fig. 3. Ranking distribution of the four scheme for candidate rule generation schemes for L SL S, FL S and L S.
12 J. Cózar et al. / Information Sciences 433–434 (2018) 1–16

Fig. 4. Ranking distribution for METSK-HDe and the methods LS, LSLS and FLS using the Freq-SD parametrization.

Table 7
Mean rank, win/tie/loss metric and p-values for the statistical
comparison in terms of test MSE.

algorithm rank p-value win tie loss

METSK-HDe 1.47 − − − −
Freq-SD-L SL S 2.06 1.8404e-01 13 0 4
Freq-SD-LS 2.82 4.4956e-03 14 0 3
Freq-SD-FLS 3.65 2.6613e-06 16 0 1

Table 8
Training and test MSE for METSK-HDe and the methods L S, L SL S and FL S using the Freq-SD parametrization. Results
should be multiplied by 10+3 , 10+1 , 10+9 , 10+8 , 10−6 , 10−4 or 10−8 in the case of baseball, forestFires, california,
house, elevators, puma32h or ailerons respectively.

problem Freq-SD-LS Freq-SD-L SL S Freq-SD-FLS METSK-HDe

tra.err tes.err tra.err tes.err tra.err tes.err tra.err tes.err

abalone 2.3633 2.5238 2.3376 2.4757 3.0663 3.0830 2.2050 2.3920


ailerons 4.6531 4.6874 4.5049 4.5806 5.5815 5.5670 1.3900 1.5100
baseball 198.1051 360.4661 175.4569 311.9695 233.6245 302.9229 0.0479 0.3688
california 2.4234 2.4535 2.3656 2.3854 3.8866 3.9034 1.6400 1.7100
compactiv 53.6934 56.1035 59.2590 61.9661 93.5328 94.0234 4.3760 4.9490
concrete 13.8325 18.4295 12.9074 17.0438 41.3264 44.0964 15.0540 23.8850
elevators 14.0 0 0 0 14.0 0 0 0 10.0 0 0 0 10.0 0 0 0 20.0 0 0 0 20.0 0 0 0 6.7500 7.0200
forestFires 180.5597 218.4974 100.8088 231.7140 168.4253 209.4564 55.10 0 0 558.70 0 0
house 7.0352 7.10 0 0 6.7443 6.8472 9.7023 9.7707 8.2900 8.6400
mortgage 0.0203 0.0356 0.0212 0.0256 0.0567 0.0670 0.0050 0.0130
mv 0.4126 0.4110 0.2750 0.2734 2.0222 2.0254 0.0600 0.0610
pole 554.0515 559.7320 535.5939 541.8291 652.2413 656.7558 57.9640 61.0180
puma32h 2.6400 2.7200 1.50 0 0 1.5200 1.9600 1.9600 0.2669 0.2871
stock 0.6054 0.7685 0.5661 0.7248 1.8670 2.0974 0.1670 0.3870
treasury 0.0343 0.0440 0.0518 0.0537 0.1051 0.1140 0.0170 0.0380
wankara 0.9944 1.2175 0.8882 1.0246 2.0471 2.0351 0.7010 1.1890
wizmir 0.7881 1.0420 0.8159 0.9550 1.6575 1.8513 0.7290 0.9440

family-wise error by the procedure of Holm [25]. The results, shown in Table 7, support that METSK-HDe outperforms both
Freq-SD-LS and Freq-SD-FLS in terms of accuracy, but there is no statistical difference between METSK-HDe and Freq-SD-
LSLS.
After that, we carried out a comparison between the outstanding algorithms,3 METSK-HDe and Freq-SD-LSLS, in terms of
simplicity of the systems. As in [5], we measured the complexity of the models as the sum of the number of antecedents

3
To confirm the equivalence between METSK-HDe and Freq-SD-L SL S in terms of MSE, we also have used a Wilcoxon Signed Rank test between them,
concluding that there is no statistical difference (p-value of 0.066 > 0.05).
J. Cózar et al. / Information Sciences 433–434 (2018) 1–16 13

Table 9
Comparison between Freq-SD-L SL S and METSK-HDe , in terms of test error, number of rules, number of predicates in
the antecedent and computational time. Test errors should be multiplied by 10+3 , 10+1 , 10+9 , 10+8 , 10−6 , 10−4 or
10−8 in the case of baseball, forestFires, california, house, elevators, puma32h or ailerons respectively.

problem Freq-SD-L SL S METSK-HDe

tes.err rules ant comp time tes.err rules ant comp time

abalone 2.4757 57.0 1.8 105.2 2995.00 2.3920 23.1 4.2 97.0 1735.00
ailerons 4.5806 17.8 1.8 31.9 12213.70 1.5100 48.4 6.0 290.4 19590.00
baseball 311.9695 26.8 1.8 48.6 70.41 0.3688 59.8 7.0 418.6 3118.00
california 2.3854 65.2 1.7 113.1 30488.07 1.7100 55.8 4.9 273.4 18808.00
compactiv 61.9661 13.8 1.4 19.8 1026.41 4.9490 32.9 6.1 200.7 13069.00
concrete 17.0438 63.6 1.7 108.0 906.15 23.8850 53.7 4.2 225.5 2102.00
elevators 10.0 0 0 0 17.2 1.8 30.1 1944.31 7.0200 34.9 5.5 191.9 11218.00
forestFires 231.7140 46.4 1.6 72.5 213.69 558.70 0 0 40.6 5.2 211.1 1675.00
house 6.8472 53.0 1.4 75.6 19184.92 8.6400 30.5 5.0 152.5 18478.00
mortgage 0.0256 15.0 1.6 24.4 101.06 0.0130 27.2 4.3 117.0 475.00
mv 0.2734 18.0 1.8 32.8 7836.71 0.0610 56.5 4.0 226.0 11874.00
pole 541.8291 16.2 1.6 26.4 1467.44 61.0180 46.3 6.3 291.7 16822.00
puma32h 1.5200 5.4 1.3 7.2 513.45 0.2871 63.3 4.0 253.2 8545.00
stock 0.7248 56.0 2.0 112.2 572.10 0.3870 66.4 5.3 351.9 2625.00
treasury 0.0537 14.2 1.7 24.5 115.01 0.0380 28.1 4.6 129.3 659.00
wankara 1.0246 29.6 1.8 52.0 354.02 1.1890 48.0 4.7 225.6 2832.00
wizmir 0.9550 19.8 1.7 33.3 170.88 0.9440 29.1 4.0 116.4 1173.00

Table 10
Comparison between Freq-SD-L SL S, METSK-HDe and PAES-RCS.

problem Freq-SD-L SL S METSK-HDe PAES-RCS

tes.err comp time tes.err comp time tes.err comp time

abalone 2.4757 105.2 2995.0 2.3920 97.0 1735.0 2.5800 33.4 389.0
ailerons 4.5806 31.9 12213.7 1.5100 290.4 19590.0 1.7300 122.8 4798.3
california 2.3854 113.1 30488.1 1.7100 273.4 18808.0 2.6100 43.2 2127.6
compactiv 61.9661 19.8 1026.4 4.9490 200.7 13069.0 5.4800 82.4 1638.4
elevators 10.0 0 0 0 30.1 1944.3 7.0200 191.9 11218.0 7.30 0 0 76.4 3073.2
house 6.8472 75.6 19184.9 8.6400 152.5 18478.0 9.0500 55.0 3075.0
mv 0.2734 32.8 7836.7 0.0610 226.0 11874.0 1.6700 29.7 3721.6

for each rule in the model (comp column in Table 9). We used a Wilcoxon Signed Rank test [34] between METSK-HDe and
Freq-SD-LSLS using the complexity metric. As we can observe in Table 9, Freq-SD-LSLS reaches the best results in terms of
the simplicity of the antecedents for all the problems but abalone (which also obtains a similar complexity). The p-value
obtained is 1.53 · 10− 5. Therefore, the models obtained by Freq-SD-LSLS are statistically simpler than those obtained from
METSK-HDe .
Table 9 also shows the execution times for both algorithms, Freq-SD-LSLS and METSK-HDe . Such times are not compa-
rable, as the executions were performed in machines with different hardware specifications.4 However it can be observed
that learning times are, at least, in the same order of magnitude or even faster in the case of Freq-SD-LSLS.

5.4. Comparing freq-SD-LSLS and PAES-RCS

In the previous experimentation we have compared a GFS (METSK-HDe ) and different approaches based on local search
algorithms for learning TSK-0 FRBSs. We have observed that METSK-HDe obtains better prediction errors for the most part
of the problems, but Freq-SD-LSLS is not statistically different from it. However, the last one learns models notably simpler.
In this stage we will compare both algorithms with PAES-RCS, a GFS for learning Mamdani FRBSs which has been proved to
build very accurate models [5]. In general, TSK rules introduce more expresiveness in the models, but reduce significantly
their interpretability. We will focus on the balance of precission and interpretability in the selected three algorithms.
In Table 10 we can see the test error, complexity metric and the computational time used to learn the models. In this
experimentation stage, we have used a subset of problems (those in common in [4,5]). As we can see, METSK-HDe is the
best method in terms of prediction error: the Friedman test p-value is 0.02, and the p-values of the mean rank test (using
METSK-HDe as control method) are 3.25 · 10−2 and 1.51 · 10−2 for Freq-SD-LSLS and PAES-RCS respectively.
On the other hand, if we compare the algorithms in terms of complexity, we can observe that PAES-RCS and Freq-SD-
LSLS are the best methods with not statistical difference between them: the Friedman test p-value is 0.02, and the mean
rank test p-values are 5.93 · 10−1 and 1.51 · 10−2 respectively for Freq-SD-LSLS and METSK-HDe .

4
Intel(R) Xeon(R) E5450 3.00GHz and 4GB RAM in the case of Freq-SD-L SL S; Intel Core 2 Quad Q9550 2.83 GHz and 8 GB RAM in the case of METSK-HDe
14 J. Cózar et al. / Information Sciences 433–434 (2018) 1–16

Table 11
Execution times (in seconds) for Freq-SD-L SL S and Freq-SD-FLS.

algorithm abalone ailerons baseball california compactiv concrete elevators forestFires

Freq-SD-L SL S 2995.0 12213.7 70.4 30488.1 1026.4 906.1 1944.3 213.7


Freq-SD-FLS 85.1 8647.5 22.5 478.6 608.6 19.6 1676.4 47.0
algorithm house mortgage mv pole puma32h stock treasury wankara wizmir
Freq-SD-L SL S 19184.9 101.1 7836.7 1467.4 513.5 572.1 115.0 354.0 170.9
Freq-SD-FLS 1408.8 59.7 1119.8 714.6 317.1 32.5 53.1 52.4 42.7

Finally, even if computational times are not comparable because the experimentation has been carried out in different
computers (with different hardware specifications), we can see that PAES-RCS seems to be the fastest algorithm, followed
by Freq-SD-LSLS and METSK-HDe (being the last one noticeably more slower than the rest).

5.5. Discussion

The first objective in this article was to test whether the computation of the individual error of a rule in the screening
process lead to a better set of candidate rules. Results reported in Section 5.2 confirm this hypothesis as they show that,
regardless of the subsequent derivation algorithm, and with independence of the way that fuzzy partitions of the variables
are generated, the use of this information improves the performance of the resulting models.
Once determined the best scheme for candidate rule selection, we aimed at testing the algorithms that search for the
best set of candidate rules. The first comparison considered the accuracy of the systems. The statistical study of the re-
sults determined that both the METSK-HDe algorithm, and the first of our proposals, the Freq-SD-LSLS algorithm, clearly
outperform the Freq-SD-LS algorithm and the optimized version of the local search, Freq-SD-FLS. Although there are no sta-
tistical differences between METSK-HDe and Freq-SD-LSLS in terms of accuracy, the win/tie/loss metric (Table 7) indicates
that METSK-HDe obtains better prediction errors for the most part of the problems. In this sense, it is interesting to remark
that Freq-SD-LSLS learns the rule base using a set of fixed fuzzy partitions, while METSK-HDe evolves both the rule base and
the data base (fuzzy partitions). Furthermore, the models learnt by local search methods are much more simpler, containing
about half of the rules than those learnt by METSK-HDe . Thus, it is quite remarkable the results obtained by Freq-SD-LSLS
in comparison with the more complete approach used by METSK-HDe . The impact of the fuzzy partitions used by the algo-
rithm is so clear that even a naive strategy like the one based on equal-frequency discretization with a fix number of fuzzy
sets clearly outperforms standard equal-width based discretization (Section 5.2). In our opinion, this fact shows that there is
room to improve our local search-based algorithms by incorporating the tuning of the data base into them, or using a more
sophisticated algorithm to generate fixed data base [37]. However, this is not an easy task, and will be the goal of future
research.
Following the previous comparison, we evaluated the simplicity of the models obtained by METSK-HDe and Freq-SD-
LSLS. As expected, due to both the size limitation included in Apriori (three antecedents), and the starting point of the local
search algorithm (empty rule base), Freq-SD-LSLS obtains simpler models than METSK-HDe , reducing the complexity in a
very significant factor, measured as the sum of the antecedents for all the rules in the model. It can be observed that this
reduction is very important in the datasets mentioned above, where METSK-HDe clearly improves Freq-SD-LSLS. However,
additional experiments show that increasing the maximum size in the Apriori stage does not lead to better results, while
generating a large amount of candidate rules. Furthermore, using some other starting point Freq-SD-LSLS neither improves
the results, but makes the algorithm slower. These two facts reinforce the idea that the key point in improving the algorithm
consists of learning, at some stage, the fuzzy partitions. Finally, because of the different hardware specifications of the
computers where the executions were performed, computational times are not comparable. However, it seems that times
are in the same order of magnitude.
In addition, we have compared the two outperforming algorithms METSK-HDe and Freq-SD-LSLS, which use TSK-0 fuzzy
rules, and PAES-RCS, a GFS which uses Mamdani fuzzy rules. We have used a subset of problems which are present in
[22] and in [5]. We have verified that METSK-HD is the best algorithm in terms of accuracy, but in exchange of a loss
of interpretability, where PAES-RCS and Freq-SD-LSLS obtain the best results. Regarding Freq-SD-LSLS, it offers a balance
between accuracy and interpretability: it is statistically worse than METSK-HDe in terms of prediction error, but better than
PAES-RCS in mean rank. On the other hand, it is statistically equivalent to PAES-RCS in terms of complexity. Regarding the
interpretability, PAES-RCS uses Mamdani fuzzy rules, which increases the readability versus TSK-0 rules. On the other hand,
Freq-SD-LSLS does not tune the data base, which simplifies the interpretability of the fuzzy partitions.
Another important point concerns the ability of the algorithms to deal with higher dimensional problems. As stated in
Section 5.2, we limited the number of variables to 40 because LS gets stuck when trying to estimate the consequents for
so many rules, and METSK-HDe has not been tested in [22] with greater datasets (in terms of number of features). In this
concern, it is necessary pointing out that LSLS allows working with greater problems, as it starts the search from an empty
set of rules, and most configurations include a small number of rules.
In relation with the FLS algorithm, the accuracy obtained by this method did not reach the results of METSK-HDe and
LSLS. However, this algorithm was designed to be more efficient over high dimmensional datasets. In this concern, the
J. Cózar et al. / Information Sciences 433–434 (2018) 1–16 15

execution times were not as low as expected in some cases, as it can be observed in Table 11. In part, this is due to
the cost of candidate rule generation (which does not change). Moreover, if the estimation of the consequent for a rule only
processes the examples covered by this rule, many rules tend to be general, and cover great proportion of examples (because
of high support). Additional experiments show that this algorithm does not reach LSLS even if considering alternative search
algorithms such as Iterate Local Search.

6. Conclusions and future work

In this work, we have studied a method for rule learning in FRBSs which is based in the use of the Apriori algorithms
for the generation of candidate rules, and two local search algorithms for the derivation of the final rule base.
In the first stage, we have successfully introduced an estimation of the individual error as criterion for filtering, trough
screening, the rules generated by Apriori. It outperforms the original method regardless of the subsequent search algorithm
and the way partitions were generated.
Afterwards, we have compared the local search algorithms and a state-of-the-art TSK-0 GFS which evolve simultaneously
the data base and the rule base (METSK-HDe ). The first one, LSLS, calculates the consequent for each rule for any of the
evaluated systems. Whereas the second, FLS, changes only one rule at each step. Results show that LSLS does not present
significant difference with METSK-HDe in terms of accuracy, but it produces simpler models, with less and smaller rules.
On the other hand, the FLS algorithm was designed for efficiency. However, and despite the fact that results obtained
were poor in terms of accuracy, the reduction in learning time compared with LSLS or METSK-HDe is remarkable.
In addition to METSK-HDe , we also have compared a state-of-the-art GFS which uses Mamdani rules (PAES-RCS) and
Freq-SD-LSLS, the only algorithm which does not tune the data base. We have observed that while METSK-HDe obtains
the most accurate and less interpretable models, PAES-RCS is the opposite (most interpretable and less accurate models in
terms of mean rank). Regarding Freq-SD-LSLS, it is the balanced option: it is the second best option in terms of accuracy
and simplicity of the models among the three algorithms.
As future work also plan to use other alternatives for candidate generation designed mainly to deal with very large
datasets, and evaluate the performance and scalability of the multiple approaches. One of the possibilities consists of directly
discard rules with very high support if they also present a high standard deviation. This can have some positive impact in
the FLS algorithm, that is still the only algorithm able to deal with much bigger datasets (in terms of number of features).
Because of that, it could be of interest to study and improve its behavior and compare it with some other algorithms.
Another work line is related with the evaluation of the systems. In this work we have used the Least Squares technique
to fix the consequent of the rules in the systems. However, C
C might be ill-conditioned because of the presence of many
redundant rules, sometimes affecting the performance (overfitting) and the complexity of the systems (including the whole
set of rules). In that sense, we plan to face that problem using multiple techniques like ridge and lasso regression, or the
minimum norm solution.
Furthermore, we plan to extend the study of local search algorithms to first-order TSK systems, where the consequent of
the rules are lineal functions of the input variables. In this case, the adaptation is not straightforward, as it is necessary to
adapt the way consequents are calculated and, moreover, the risk of overfitting is much higher.
Last, this work is mainly devoted to rule derivation given the fuzzy partitions of the variables, which have been obtained
in a straightforward way. However, algorithms which also learn such partitions, like METSK-HDe , usually achieves great
improvements in the obtained results. In this respect, it is essential to study the effect of introducing some partitioning
algorithm as a previous step in the proposed local search algorithms.

Acknowledgments

This study has been partialy funded by the Spanish Government (MINECO) and FEDER funds through project TIN2013-
46638-C3-3-P. Javier Cózar has also be funded by MICINN through the grant FPU12/05102.

References

[1] R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in large databases, SIGMOD Rec. 22 (2) (1993) 207–216.
[2] J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, KEEL Data-mining software tool: data set repository, integration of algo-
rithms and experimental analysis framework, J. Multiple-Valued Logic Soft Comput. 17 (2–3) (2010) 255–287.
[3] R. Alcalá, M.J. Gacto, F. Herrera, A fast and scalable multiobjective genetic fuzzy system for linguistic fuzzy modeling in high-dimensional regression
problems, IEEE Trans. Fuzzy Syst. 19 (4) (2011) 666–681.
[4] J. Alcala-Fdez, R. Alcala, F. Herrera, A fuzzy association rule-based classification model for high-dimensional problems with genetic rule selection and
lateral tuning, Fuzzy Syst. IEEE Trans. 19 (5) (2011) 857–872.
[5] M. Antonelli, P. Ducange, F. Marcelloni, An efficient multi-objective evolutionary fuzzy system for regression problems, Int. J. Approx. Reason. 54 (9)
(2013) 1434–1451.
[6] M. Antonelli, P. Ducange, F. Marcelloni, A. Segatori, On the influence of feature selection in fuzzy rule-based regression model generation, Inf. Sci. 329
(2016) 649–669.
[7] J. Arias, J. Cózar, ExReport: fast, reliable and elegant reproducible research, 2015, http://exreport.jarias.es/.
[8] J. Casillas, O. Cordón, F. Herrera, Cor: a methodology to improve ad hoc data-driven linguistic rule learning methods by inducing cooperation among
rules, IEEE Trans. Syst. Man Cybern. Part B 32 (4) (2002) 526–537.
[9] O. Cordón, E. Herrera, E. Gomide, E. Hoffman, L. Magdalena, Ten years of genetic fuzzy systems: current framework and new trends, 2001, 3, 1241–
1246.
16 J. Cózar et al. / Information Sciences 433–434 (2018) 1–16

[10] O. Cordón, F. Herrera, A proposal for improving the accuracy of linguistic modeling, IEEE Trans. Fuzzy Syst. 8 (3) (20 0 0) 335–344.
[11] O. Cordón, F. Herrera, F. Hoffmann, L. Magdalena, Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases, World Scientific,
2001.
[12] O. Cordón, F. Herrera, A. Peregrín, Applicability of the fuzzy operators in the design of fuzzy logic controllers, Fuzzy Sets Syst. 86 (1) (1997) 15–41.
[13] J. Cózar, L. delaOssa, J.A. Gámez, Learning cooperative TSK-0 fuzzy rules using fast local search algorithms, 2011, 353–362.
[14] J. Cózar, L. delaOssa, J.A. Gámez, Learning TSK-0 linguistic fuzzy rules by means of local search algorithms, Appl. Soft Comput. 21 (0) (2014) 57–71.
[15] J. Cózar, L. delaOssa, J.A. Gámez, TSK-0 fuzzy rule-based systems for high-dimensional problems using the apriori principle for rule generation, Rough
Sets and Curr. Trends Comput. 8536 (2014) 270–279.
[16] L. delaOssa, J. Gámez, J. Puerta, Learning cooperative fuzzy rules using fast local search algorithms, 2006, 2134–2141.
[17] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30.
[18] M. Fazzolari, R. Alcala, Y. Nojima, H. Ishibuchi, F. Herrera, A review of the application of multiobjective evolutionary fuzzy systems: current status and
further directions, IEEE Trans. Fuzzy Syst. 21 (1) (2013) 45–65.
[19] A. Fernandez, V. Lopez, M. del Jesus, F. Herrera, Revisiting evolutionary fuzzy systems: taxonomy, applications, new trends and challenges, Knowl.
Based Syst. 80 (2015) 109–121.
[20] M. Friedman, A comparison of alternative tests of significance for the problem of m rankings, Ann. Math. Stat. (1940) 86–92.
[21] M.J. Gacto, R. Alcalá, F. Herrera, Interpretability of linguistic fuzzy rule-based systems: an overview of interpretability measures, Inf. Sci. 181 (20)
(2011) 4340–4360.
[22] M.J. Gacto, M. Galende, R. Alcalá, F. Herrera, METSK-HDE: a multiobjective evolutionary algorithm to learn accurate tsk-fuzzy systems in high-dimen-
sional and large-scale regression problems, Inf. Sci. 276 (2014) 63–79.
[23] F. Glover, G. A. Kochenberger (eds.), Handbook of Metaheuristics, International Series in Operations Research & Management Science, Kluwer Academic
Publishers, 2003.
[24] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, 1st ed., Addison-Wesley Longman Publishing Co., Inc., 1989.
[25] S. Holm, A simple sequentially rejective multiple test procedure, Scand. J. Stat. (1979) 65–70.
[26] V.C. Klema, A.J. Laub, The singular value decomposition: its computation and some applications, Autom. Control, IEEE Trans. 25 (2) (1980) 164–176.
[27] E. Mamdani, Applications of fuzzy algorithm for control a simple dynamic plant, 1974, 1585–1588.
[28] E. Mamdani, S. Assilian, An experiment in linguistic synthesis with a fuzzy logic controller, Int. J. Man Mach. Stud. 7 (1975) 1–13.
[29] K. Nozaki, H. Ishibuchi, H. Tanaka, A simple but powerful heuristic method for generating fuzzy rules from numerical data, Fuzzy Sets Syst. 86 (3)
(1997) 251–270.
[30] S.E. Papadakis, J. Theocharis, A GA-based fuzzy modeling approach for generating TSK models, Fuzzy Sets Syst. 131 (2) (2002) 121–152.
[31] T. Takagi, M. Sugeno, Fuzzy identification of systems and its applications for modeling and control, IEEE Trans. Syst. Man Cybern. 15 (1) (1985) 116–132.
[32] C. Wang, T. Hong, S. Tseng, Integrating fuzzy knowledge by genetic algorithms, IEEE Trans. Evol. Comput. 2 (4) (1998) 138–149.
[33] L. Wang, J. Mendel, Generating fuzzy rules by learning from examples, IEEE Trans. Syst. Man Cybern. 22 (6) (1992) 1414–1427.
[34] F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull. 1 (6) (1945) 80–83.
[35] L. Zadeh, Outline of a new approach to the analysis of complex systems and decision processes, IEEE Trans. Syst. Man Cybern. 3 (1) (1973) 28–44.
[36] L. Zadeh, The concept of a linguistic variable and its application to approximate reasoning, Inf. Sci. 8 (1975) 199–249.
[37] M. Zeinalkhani, M. Eftekhari, Fuzzy partitioning of continuous attributes through discretization methods to construct fuzzy decision tree classifiers,
Inf. Sci. 278 (2014) 715–735.

Vous aimerez peut-être aussi