Knowledge discovery from data: InterCriteria Analysis of mutation rate influence

Abstract: In this paper the InterCriteria Analysis (ICrA) approach is applied to find more knowledge from series of identification procedures using 34 differently tuned genetic algorithms (GAs). The influence of the mutation rate pm on the algorithm performance is investigated. An E. coli fed-batch fermentation process model is used as a test problem. Based on the results from parameter identification, namely objective function values, the GAs, with the correspondent pm-value, producing the best results are determined. Frther, ICrA is applied using information from all model parameter estimates, computational time and objective function value. The ICrA confirms the conclusions based only on objective function values and helps to choose what mutation rate pm is more appropriate to use in the considered case study.


Introduction
InterCriteria Analysis (ICrA), proposed by [4], is a recently developed approach for evaluation of multiple objects against multiple criteria and thus discovering existing correlations between the criteria themselves. Various applications of the ICrA approach have been found in science and practice -neural networks [19], properties of the crude oils [20], e-learning [11], algorithms performance [8], ecology [10], etc.
The ICrA is applied for establishing certain relations, considering parameters identification of cultivation models applying genetic algorithms (GAs). GA is a stochastic global optimization method. Among a number of optimization techniques, GA is one of the methods based on biological evolution and inspired by Darwin's theory of survival of the fittest [9].
In [17] the investigation is particularly focused on the relations between the E. coli cultivation model parameters µ max , k S and Y X/S , on one hand, and the GA parameter ggap (generation gap), the convergence time and the model accuracy, on the other hand. Authors in [1] searched for relations between six S. serevisiae cultivation model parameters, GA parameter population size, and GA outcomes -convergence time and objective function value. The results show that, based on ICrA, additional useful knowledge about identification procedures of cultivation process models can be derived and better performance of genetic algorithms can be achieved. Moreover, such research leads to a deeper understanding of the relations between the cultivation process model parameters.
Following these results, another important GA parameter, namely mutation rate p m , is investigated here. The choice of mutation rate is one of the critical issues to the success performance of genetic algorithms [14,15]. In this research the focus is on the influence of p m on the model accuracy (value of the optimization criterion J), convergence time T and model parameters estimations. Thus, differently tuned GAs (with different p m values) could be compared and some conclusions about the most appropriate mutation rate could be done.
An E. coli fed-batch fermentation model is considered as a case study. This process has been chosen as a representative of some of the important microorganisms with numerous applications in food and pharmaceutical industry, as well as one of the widely used model organisms in genetic engineering and cell biology, due to their well known metabolic pathways [13].
The paper is organized as follows: the problem formulation is given in Section 2, while Section 3 presents the background of ICrA. Numerical results and discussion are presented in Section 4 and conclusion remarks are given in Section 5.

Investigation on mutation rate influence
The mathematical model of the E. coli fed-batch fermentation process is presented by the following non-linear differential equations system [6]: where X is the biomass concentration, [g/l]; S is the substrate concentration, [g/l]; F in is the feeding rate, [l/h]; V is the bioreactor volume, [l]; S in is the substrate concentration in the feeding solution, [g/l]; µ max is the maximum value of the µ, [1/h]; k S is the saturation constant, [g/l]; Y S/X is the yield coefficient, [-].
The following parameter vector p = [µ max k S Y S/X ] (see Eqs. (1)-(3)) should be identified using GAs with different values of mutation rate.
Model parameters identification is performed based on real experimental data for biomass and glucose concentration. The detailed description of the process conditions and the experimental data is given in [18].
The impact of p m will be examined by The interval of p m values from 0.001 to 0.1 is chosen on the basis of the reported results [7,12,16]. While the mutation rate is varied using vector p m , all the other parameters and operators are kept constant (see Table 1) and thus 34 differently tuned GAs are constructed. With so constructed GAs, series of identification procedures of the mathematical model Eqs. (1)-(3) will be performed with the following objective function: where m and n are the experimental data dimensions; X exp and S exp are the available experimental data for biomass and substrate; X mod and S mod are the model predictions for biomass and substrate with a given model parameter vector, p = [µ max k S Y S/X ].

InterCriteria analysis
Following [4] and [2], an Intuitionistic Fuzzy Pair (IFP), as the degrees of "agreement" and "disagreement" between two criteria applied on different objects, will be obtained. As a remainder, an IFP is an ordered pair of real non-negative numbers a, b , such that a + b ≤ 1.
For clarity, let an IM [3], whose index sets consist of the names of the criteria (for rows) and objects (for columns), be given. The elements of this IM are further supposed to be real numbers, which is not required in the general case. An IM with index sets, consisting of the names of the criteria, and IFPs, corresponding to the "agreement" and "disagreement" of the respective criteria, as elements will be obtained.
Let O denotes the set of all objects being evaluated, and C(O) is the set of values assigned by a given criteria C (i.e., C = C p for some fixed p) to the objects, i.e., Then the following set can be defined: x ≺ y will be written iff i < j. The vectors of all internal comparisons for each criterion are constructed in order to find the agreement of different criteria. The elements of the vectors fulfil one of the three relations R, R andR : x, y ∈R ⇔ x, y / ∈ (R ∪ R), For example, if "R" is the relation "<", then R is the relation ">", and vice versa. Hence, for the effective calculation of the vector of internal comparisons, denoted further by V (C), only the subset of C(O) × C(O) needs to be considered, namely: . Then, the vector with lexicographically ordered pairs as elements is constructed for a fixed criterion C: Then, the degree of "agreement" between two criteria, which are to be compared, is determined as the number of the matching components, divided by the length of the vector for the purpose of normalization. This can be done in several ways, e.g. by counting the matches or by taking the complement of the Hamming distance. The degree of "disagreement" is the number of the components of opposing signs in the two vectors, again normalized by the length. This also may be done in various ways.
4 Numerical results and discussion

Parameter identification
The constructed 34 GAs are used to perform series of parameter identification procedures of the mathematical model (1) The 34 IMs for all suggested p m -values (see expression 4) are constructed in this manner. The full set of identification results is available at http://intercriteria.net/studies/ gap/mutr/, because of the limited space here.
The obtained average values of the objective function for each GA, respectively each chosen values of p m , are presented in Fig. 1. As can be seen, the p m -value lower than 0.028 and greather than 0.073 do not produce good results. Two local minima -4.48 and 4.49, are selected as the best estimates. Nine results for these two local minima are observed, marked in red circles in Fig. 1. The corresponding mutation rates p m are listed in Table 2.   Table 2. In order to draw more conclusions about the influence of the mutation rate on GA performance, based on the obtained results, the ICrA is further applied. Thus, the overall GA performance is evaluated, based not only on the obtained J values, but on the basis of the total computational time T and the three model parameters estimates too.

ICrA of parameter identification results
The ICrA is performed to compare the performance of the 34 differently tuned GAs. All the 34 IMs (IM (10), IM (11) and the rest IMs in http://intercriteria.net/studies/mutr/e-coli/results.zip) are evaluated based on the ICrA algorithm. Thus, the degree of "agreement" (µ C,C -values) between GA outcomes, e.g. computation time (T ), objective function value (J) and model parameters estimations (µ max , k S and Y S/X ) are evaluated. The resulting IM is presented by the Exp. (12), where the criterion C1 is the objective value J, C2 is the computation time T , C3 is the model parameter µ max , C4 is the model parameter k S and C5 is the model parameter Y S/X . For thus constructed IM (Exp. (12)), ICrA is applied again. The 34 different GAs are assumed as criteria and the 10 pairs of criteria are assumed as objects. As a result, 561 pairs of criteria are obtained. Intuitionistic fuzzy triangle representation is presented in the Fig. 2. Fig. 3 shows all observed µ C,C -and ν C,C -values, sorted in descending order of the µ C,C -values. Due to the large number of pairs, 561 criteria pairs, only some of them are printed in x-axis legend.
The scheme proposed in [5] is used (see Table 3) for analyses of the observed degree of "agreement" (consonance) and degree of "disagreement" (dissonance) between each pair of criteria.
Based on the scale presented in Table 3, 342 criteria pairs are found to be in positive consonance -3 pairs in SPC, 89 in PC and 250 in WPC. The rest 219 criteria pairs are in dissonance. There are no pairs in negative consonance. Only the pairs that are in strong positive consonance @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ Degree of "agreement", µ C,C Degree of "disagreement", ν

Conclusion
Differently tuned GAs are investigated in this work in order to define the influence of the mutation rate p m on the algorithm performance. 34 different values of p m are used in the range between 0.001 and 0.1. The 34 GAs are applied to parameter identification of an E. coli fed-batch fermentation process model. Based on the obtained results, 9 GAs with following mutation rate p m : 0.028, 0.034, 0.04, 0.055, 0.058, 0.067, 0.07 and 0.073, are defined as the best performed algorithms. Further, by applying ICrA, more knowledge about the identification results of all 34 GAs is sought. The algorithm performance in this case is evaluated based on data of all parameter estimates, computational time and objective function value. The obtained results confirm the previous choice of best performed GAs and gives an additional knowledge about the relation and correlation between the 34 investigated GAs.