Comparison of Backpropagation and Kohonen Self Organising Map (KSOM) Methods in Face Image Recognition

Background: Human face is a biometric feature. Artificial Intelligence (AI) called Artificial Neural Network (ANN) can be used in recognising such a biometric feature. In ANN, the learning process is divided into two: supervised and unsupervised learning. In supervised learning, a common method used is Backpropagation, while in the unsupervised learning, a common one is Kohonen Self Organizing Map (KSOM). However, the application of Backpropagation and KSOM need to be adjusted to improve the performance. Objective: In this study, Backpropagation and KSOM algorithms are rewritten to suit face image recognition, applied and compared to determine the effectiveness of each algorithm in solving face image recognition. Methods: In this study, the methods used and compared in the case of face image recognition are Backpropagation dan Kohonen Self Organizing Map (KSOM) Artificial Neural Network (ANN). Results: The smallest False Acceptance Rate (FAR) value of Backpropagation is 28%, and KSOM is 36%, out of 50 unregistered face images tested. While the smallest False Rejection Rate (FRR) value of Backpropagation is 22%, and KSOM is 30%, out of 50 registered face images. The fastest time for the training process using the backpropagation method is 7.14 seconds, and the fastest time for recognition is 0.71 seconds. While the fastest time for the training process using the KSOM method is 5.35 seconds, and the fastest time for recognition is 0.50 seconds. Conclusion: Backpropagation method is better in recognising face images than KSOM method, but the training process and the recognition process by KSOM method are faster than Backpropagation method due to the hidden layers.


I. INTRODUCTION
Face images, fingerprints and voices are biometric features that can be used as authentication methods. With a computer-based biometric recognition system, digital images containing biometric features with certain characteristics are processed using the Artificial Intelligence (AI) to identify a person. Artificial neuronal network (ANN) is a method that represents the working system of human brain by simulating the learning process. It can solve digital image processing problems in identification, classification, authentication, optimisation, diagnostics and approximation [1]. ANN is the best recognition method, compared to fuzzy and other hybrid techniques [2]. ANN can learn from experience, generalise the examples it gets and abstract the input characteristics, even for irrelevant data. It is important to note that ANN is not programmed to produce a specific output. All outputs or conclusions drawn by the network are based on their experiences during the learning process. In the learning process, the input and output patterns are inserted into the ANN, then the network will be taught to provide acceptable answers. [3].
In ANN, the learning process methods are divided into two: supervised and unsupervised learning. Supervised learning takes place in artificial neural networks where the expected output value is known in advance. In the learning or training process, the resulting output patterns will be compared with the target output patterns. If the difference is too large, learning will be carried out again, until it reaches the minimum tolerable error value [4]. Unsupervised 150 learning takes place in artificial neural networks that does not require an output target. The purpose of this lesson is to group units that are almost the same or to classify patterns [5].
Face image recognition is a problem that is often encountered in previous studies, even using the Backpropagation or KSOM method. Therefore, in this research, the Backpropagation and KSOM algorithms were rewritten with sequential steps that had been adapted to the application of face image recognition. The Backpropagation method was chosen because it is one of the most popular supervised learning ANN methods, with excellent performance. In previous research, the success rate of face image recognition using the Backpropagation method was 98% [6]. Also, other research concluded that the Backpropagation Method with improvised Scaled Conjugate Gradient (SCG) generates average recognition rate of 93% and an average number of iterations of 815 [7]. The KSOM method was chosen because it is one of ANN the most widely used unsupervised learning methods, with a high accuracy value, and a short recognition time process. In previous studies, the KSOM method has been able to recognise face images with an accuracy value of 98% [8] within 0.008-0.03 sec [9].
The problem in this research is how to apply the steps of the Backpropagation and KSOM algorithms in terms of face image recognition, analysing the performance of each method measured based on FAR, FRR and the success rate in recognising face images. The novelty in this research compared to other studies is that the performance of the Backpropagation and KSOM methods is measured based on the success rate based on the level of face tilt (turning face), face expressions and accessories used, and the time needed to carry out the learning and recognition process. Secondly, the novelty in this research is that the face image database used is a combination of the database from The Olivetti Research Laboratory (ORL) Database, and face images taken from a web camera. Thirty face image identities were obtained from ORL dataset, and 20 face image identities were taken using a web camera. The total is 50 face image identities, with 10 face images in each sample, so the total is 500 face images.

II. LITERATURE REVIEW
A. Digital Image Digital image is a matrix consisting of rows and columns, where the value of each row and column index represents a point on the image. True colour image is an image representation that has three main component values, namely red, green and blue (RGB). Each component in the true colour image has 256 possible values, so that the overall true colour image has a total of 16,777,216 possible colours. Grayscale images are also called 8-bit images because they have 28 (256) possible values for each pixel. The values start at zero for black and 255 for white. The image that will be used and processed by the system is a grayscale image. Therefore, true coloured image must be converted into grayscale image. To change a true coloured image to greyscale, it takes the values of R, G, B from the true coloured image. The values start at zero for black and 255 for white.

B. Artificial Neural Network (ANN)
Artificial neuronal network (ANN) is a computer science method that represents the working system of human brain by simulating the learning process. ANN can solve digital image processing problems in identification, classification, authentication, optimization, diagnostics and approximation [1]. ANN is the best recognition method, compared to fuzzy and other hybrid techniques [2]. ANN can learn from experience, generalise the examples it gets and abstract the input characteristics, even for irrelevant data. ANN is not programmed to produce a specific output. All outputs or conclusions drawn by the network are based on their experiences during the learning process. In the learning process, input (and output) patterns are inserted into the ANN, then the network will be taught to provide acceptable answers.

C. Supervised Learning
Supervised learning is a learning method in artificial neural networks, where the expected output value is known in advance. In the learning or training process, the resulting output patterns will be compared with the target output patterns. If the difference is too large, learning will be carried out again, until it reaches the minimum tolerable error value. [5] D. Unsupervised Learning Unsupervised learning is a learning method in artificial neural networks that does not require an output target. The purpose of this lesson is to group units that are almost the same or to classify patterns. [6] 151 III. METHODS

A. Convert True Coloured Image to Grayscale
The conversion of true coloured image to grayscale image changes the pixel value which originally had 3 values, namely R, G, B into one value, namely gray. The following (1) is used to get the gray value [10]: where is the gray value in the th pixel, is the weight for the red element, is the weight for the blue element, is the weights for the green colour elements, is the intensity value of the red colour element, is the intensity value of the blue colour element, and is the green element intensity value.
NTSC (National Television System Committee) defines the weights for converting true coloured images to greyscale as follows: = 0.299, = 0.587, = 0.114. Input data is in the form of true coloured images and the output data is in the form of grayscale images.

B. Image Size Normalisation
Image normalisation is performed to uniform the size of the face image to be processed. The size used is 92x112 pixels, according to the standard 'The ORL Database of Faces', AT&T Laboratories Cambridge. The normalisation process is carried out using the bilinear interpolation method. [11] A digital image with dimensions of x pixels is defined as a two-dimensional matrix, with the value are the points around the to-be-interpolated point. In the normalisation process, the input image is a Bitmap image with various sizes and the output image is a Bitmap image with a size of 92x112 pixels.

C. Binary and Bipolar Sigmoid Activation Functions
The results of initialisation process are converted into a range [1, -1] with the bipolar sigmoid function formula to make it easier and faster in the training process using the Backpropagation algorithm as shown in Fig. 1 [12] [13].
whereas in training using the KSOM algorithm, the binary sigmoid function is used which has a value range of 0 to 1. Fig. 2 shows the activation function of binary sigmoid [12] [14].
is an image transformation based on cosine. The properties of DCT are the frequency in this transformation is real, orthogonal, separable, with an efficient computation process. In this system, two-dimensional (2D) DCT is used to extract features or obtain certain features in face images and also to reduce dimensions in face images to speed up the training process [15] [16].
This retrieval method is caused by the DCT coefficient, which is scattered in the upper left corner and the lower the value is getting smaller and can be ignored. In accordance with the Backpropagation theory, training will be faster if the input value is within the range of the activation function. Since the activation function used is a bipolar sigmoid function, the input layer values will be converted between ranges [1, -1]. The conversion process can be formulated: where = 0,1, … is the sum of the DCT-1 coefficients, = 0,1, … is the number of data sets -1, is the input layer, is the DCT coefficient, is the lowest DCT coefficient value, and is the highest DCT coefficient value.

E. Viola-Jones Face Detection Algorithm
The Viola-Jones algorithm is the most widely applied method to detect face image with fast, accurate and efficient performance. The main processes of Viola-Jones are: 1. Haar-like feature selection The image to be processed is classified based on the feature value, to separate the images that are not needed. In this process, the background in the image is not counted. There are three types of features that are used based on the number of rectangles contained in it: two, three, four rectangles.

Creating integral image
Integral image is a data structure and method that adds up the subset values in the image matrix.

AdaBoost training
The AdaBoost algorithm works to find features that have a high level of differentiation. This is done by evaluating each feature between the face and non-face parts that are considered as the best feature.

Cascading Classifier
The characteristic of the Viola-Jones method is the cascading classifier. Classification in this method consists of several levels and each level produces sub-images that are believed not to be part of the face. This is done because it is easier to assess sub-images that are not part of the face compared to the face. [17] [18] F. Self-Organising Map (KSOM) Method The Kohonen Self-Organising Map (KSOM), also known as Kohonen map/network learning method is an unsupervised learning method, so this network structure does not require an output target. It modifies the weight of an ANN without needing to determine the output for a specific input pattern. The advantage is that it enables the network to find solutions, thus making it more effective and efficient with pattern connections. The main disadvantage is that interpreting the output must be done correctly [2] [19].
The activation function used in KSOM is Sigmoid Binary (Logsig), so before entering the learning process, the input data must be changed so that the value is in the range 0 to 1. All face images will be trained according to the algorithm of this method to update the weights which will be the final weights used for the recognition process.
The initial values needed for entering the KSOM algorithm such as maximum epoch, learning rate/alpha ( ), data training matrix, random initial weight matrix and alpha reduction ( ). The algorithm for the KSOM is described as follows: [14] [20]  Initialisation: determine the initial weight random value of  Do this if the stop condition is FALSE 1) For every , equation (13) for calculating: 2) Specify , up to ( ) the smallest/minimum value 3) For unit , for each , equation (14) for counting: 4) Equation (15) for improving learning rate:  Do this until the epoch value is reached or the condition test stops. Equation (16) is used for calculating the Euclidian distance, in the matching or recognition process: where is the Euclidean distance, is the weight of neuron (final weight), and Xi is the vector input to . The minimum Euclidian distance is the result of recognition that best matches the stored face image. The threshold is used to limit the Euclidean Distance in the matching/recognition process. While the similarity distance serves to limit the iteration of changes in weight so that the best weight is obtained, even though the maximum epoch has not been fulfilled.

G. Backpropagation Method
One of the most widely used artificial neural network training algorithms in the field of pattern recognition is Backpropagation. This algorithm is generally used in multi-layer feed-forward neural networks, which are composed of several layers and the signal is flowed in a unidirectional direction from input to output [21] [22]. The Backpropagation training algorithm basically consists of three stages: a. Input the value of the training data so that the output value is obtained b. Backpropagation of the error value obtained c. Weight connection adjustment to minimize error value These three stages are repeated continuously until the desired error value is achieved. After the training is complete, only the first step is needed to utilise the neural network. Error information is propagated sequentially starting from the output layer and ending at the input layer, so this algorithm is named Backpropagation [13] [23] [24].

154
In training artificial neural networks using the backpropagation algorithm, the steps are as follows: a. Initialise the weights with a random value between -0.5 to 0.5 b. Determine the learning rate ( ). c. Specify the error tolerance value or threshold value (when using the threshold value as a stop condition) or the maximum set of epochs (when using the number of epochs as a stop condition). d. Perform the following steps as long as the stop condition has not been met (value FALSE) 1) For each pair of training patterns, do the following. a) Feedforward a. Each input unit (from the 1 st unit to the n th unit in the input layer, i = 1, ..., n) sends an input signal to all units in the upper layer (to the hidden layer); b. For each unit in the hidden layer (from the 1 st unit to the p th unit; j = 1, ... p) the hidden layer output signal is calculated by applying the activation function to the sum of the input signals weighs . In this study, the bipolar sigmoid activation function is used (17); then sent to all overlay units. c. Each unit in the output layer (from the 1 st unit to the m th unit, i= 1, ..., n; k = 1, ..., m) is calculated the activation function (18) of the zj-weighted sum of the input signals for this layer: then sent to all overlay units.

b) Backpropagation
(1) Each unit of output (from 1 st unit to th unit = 1, … , ; = 1, … , ) receives a target pattern tk and then the output layer error information (δk) is computed by (19). δk is sent to the layer below it and is used to calculate the weight and bias correction (ΔWjk and ΔW0k) between the hidden layer and the output layer. It is shown on (20) (21): (2) For each unit in the hidden layer (from 1 st unit to p th unit i = 1, ..., n; j = 1, ..., p; k = 1, ...., m) the layer error information is calculated by (22). Hidden (δj). δj is then used to calculate the amount of weight and bias correction (ΔVij and ΔV0j) between the input layer and the hidden layer. It is shown on (23) (24):

c) Weights and Bias Updates
(1) For each unit of yk output (from the 1 st unit to the m-unit), the bias and weight (j = 0, ..., p; k = 1, ...., m) are corrected so that the new bias and weights become (25): From the 1 st unit to the p-unit in the hidden layer, updates are also made to the bias and the weights (i = 0, ..., n j = 1, ..., p) by the following formula (26): 2) Calculate the MSE (Mean Squared Error) with the formula (27): Difference of squares of Output Target and output value. 3) Stop condition test: stop condition is TRUE, if the MSE value is more than or equal to the error tolerance value, or the number of Epochs has not exceeded the maximum epoch.
In the matching/recognition process, do the feedforward process using the final weights.

IV. RESULTS
The process of testing the Kohonen Self Organizing Map (KSOM) Neural Network, used training variables as follows: training rate () = 0.6; alpha reduction (δ) = 0.5; threshold = 0.02; and similarity distance = 0.000000000000001. The threshold function is to limit the Euclidean distance in the matching/recognition process. While the similarity distance serves to limit the iteration of changes in weight so that the best weight is obtained, even though the maximum epoch has not been fulfilled.
A. Flowchart Fig. 3 illustrates the flowchart of the training process and face image recognition system.

B. Data for Training and Testing
In the training process, 50 identities were used with 10 face images each, so that a total of 500 face images were used as training data. Meanwhile, during the testing process, the face image used is different from the stored face image in the training process. To calculate FAR, 50 face images-whose identities were not stored in the database and had different identities-were used as testing data. Meanwhile, to calculate FRR, 50 different face images-whose identities are stored in the database-were used.
The input image as the training data, is a face image, or an image that has a background. The image that had been taken was processed by face detection using the Viola Jones algorithm. Cropping was done to take a face image, by removing the background. Then, the true coloured image was converted to grayscale image. The grayscale image was normalised to a size of 92x112 pixels. Feature extraction was carried out to obtain image characteristics (59 values), which were used as input data in the input layer. Then the image was ready to be trained using the Backpropagation or KSOM algorithm. After the training process, the testing was carried out.
The face image database used is 500 face images, consisting of 50 identities with 10 face images for each identity. From the 50 identities, 30 identities came from ORL face dataset [11], and 20 identities were taken using a web camera. The 10 face images were taken with several tilt angles and face expressions. These were sample training/learning and testing images that had been cropped and converted to grayscale. Fig. 4 shows the images from ORL database; and Fig. 5 shows sample images taken from a web camera. Face images taken from a web camera were taken in a closed room with a room area of 12m 2 and a room height of 4m, with a lamp, namely LED Cool Daylight 6-Watt, brightness level of 470 Lumen. The distance between the camera and the face is 40-50cm. Because the distance and lighting levels have been limited, this system only allows to recognise skin colour according to the training image stored in the database. Table 1 is an example of the matrix used in the training process, adjusted for the number of input layers and the number of registered face image identities. The database used in this system is simple. The tables used are the training table and the identity table. The identity  table shown

D. Definition of FAR and FRR
In the case of matching or recognition, the accuracy of the system is measured by Accuracy Rate, False Acceptance sRate (FAR) and False Rejection Rate (FRR). False Acceptance (FA) occurs when the system accepts a face image whose identity is not recorded in the database. False Rejection (FR) occurs when the system rejects a face image whose identity is stored in the database. The success rate is the percentage of the system's success in recognising the right face, namely accepting face images stored in the database, or rejecting face images that are not stored in the database. The rate of Accuracy, FA and FR is calculated using (28), (29) and (30) [25] [26] [27]. Therefore, if the FAR or FRR is lower, then the success rate will be higher.

E. FAR and FRR of Backpropagation based on the Number of Hidden Layers
In the testing process of determining the best hidden layer on the Backpropagation, the following training variables are used: input layer = 35, training rate () = 0.008, momentum () = 0.02, error tolerance = 0.01. The face image used is in accordance with training and testing data. Table 2 shows the comparison of the calculation of the FAR and FRR values of the face image recognition system using the Backpropagation method based on the number of hidden layers. To calculate FAR, 50 face images whose identities were not stored in the database were used; and to calculate different FRRs, 50 face images whose identities were stored in the database were used. From the testing results, it can be concluded that the best number of hidden layers in the face image recognition system using the Backpropagation method is 30 hidden layers.

F. Comparison of the Backpropagation and KSOM
The testing used the KSOM method, the training variables were used as follows: training rate () = 0.6; alpha reduction (δ) = 0.5; threshold = 0.02 and similarity distance = 0.000000000000001. Table 3 shows a comparison of the FAR and FRR of the face image recognition process using the Backpropagation or KSOM method. In the first experiment, to calculate FAR, 50 face images whose identities were not stored in the database were used; and to calculate different FRRs, 50 face images whose identities were stored in the database were used. From the test results, it can be seen that the FAR of Backpropagation is 28%, and the FAR of KSOM is 36%., while the FRR of Backpropagation is 22%, and the FRR of KSOM is 30%. The average success rate of the Backpropagation is 75%, and the average success rate of the KSOM is 67%. Because the FAR and FRR of Backpropagation are lower than KSOM, and the success rate of Backpropagation is higher than KSOM, so it can be concluded that the Backpropagation method is better in recognising face images than the KSOM method. In the second experiment, a recognition test was conducted based on the level of face tilt, such as tilting right, left, up and down. In this experiment, 20 face images were used whose identities were stored in the database. The result of the second experiment is shown in Table 4. From the experiment results that is written in Table 5, it is known that the Backpropagation method has a higher success rate based on the level of face tilt than KSOM method. Face image with right or left tilt levels are easier to identify than up and down tilts. The degree of tilt to the left or right of 15 ⁰ is the most easily recognized face tilt. In the third experiment, a recognition test was conducted based on face expressions and the accessories used. In this experiment, 20 face images were used whose identities were stored in the database. Based on face expressions, the average success rate of the Backpropagation method is 78%, while the average success rate of the KSOM method is 70%. The Backpropagation method can only recognise faces by wearing a hat with a success rate 20%, while the average success rate of the KSOM method is 10%. The Backpropagation method can only recognise faces by wearing a pair of glasses with a success rate 15%, while the average success rate of the KSOM method is 10%. Backpropagation and KSOM methods cannot recognise faces by using mask accessories. Therefore, it can be concluded that the success rate of the Backpropagation method is higher than KSOM method based on face expressions and the wearing a hat or glasses. In the fourth experiment, a recognition test was conducted based on the time of the training process and the recognition process. The training process is carried out by processing 500 face images from 50 identities that have been stored in the database; and the recognition process is carried out by processing the recognition of one face image. From the Table 6, it can be seen that the time for the training process using the backpropagation method is 7.14 seconds, and the time for recognition is 0.57 seconds. The time for the training process using the KSOM method is 5.49 seconds, and the time for recognition is 0.50 seconds. Therefore, it can be concluded that the training process and the recognition process of the KSOM method are faster than the Backpropagation Method. This is because in the Backpropagation method, there are hidden layers that demands the computation of the feed forward.

V. DISCUSSION
Firstly, the face image is detected using the Viola-Jones Algorithm, and then cropping process on the face image [17]. After that, the true coloured image is converted to the grayscale image [10]. Then the grayscale face image is normalised, so that all face images have the same size. Feature extraction on face image is carried out using the DCT (Discrete Cosine Transform) method [16]. The Matrix result from the feature extraction process, becomes input data in the training process. Backpropagation or KSOM method is used as a training process, using predetermined variables. The variables used in the KSOM method are: training rate (), alpha reduction (δ), threshold, similarity distance, and maximum epoch [20]. While the variables used in the Backpropagation Method are: training rate (), momentum (), error tolerance, number of hidden layers, and maximum epoch [22]. After the training process, the recognition process can be carried out. If the training process uses the KSOM method, then the matching in the recognition process uses the Euclidean distance. If the training process uses the Backpropagation method, then the matching in the recognition process uses a feedforward process.
It is important to note that, instead of binary, this research uses Backpropagation with a bipolar sigmoid function, because the results of using a bipolar sigmoid function are more precise and suitable to be applied. The input layer values of the bipolar sigmoid function are converted between ranges [1, -1] [13]. Secondly, in the KSOM method, the activation function recommended is a binary sigmoid function, so that the input layer values are converted between ranges [0,1] [14]. Third, in the application of the KSOM method, determining the threshold value also greatly influences the matching/recognition process, because the threshold value is used as the limit of the Euclidean distance in determining the identity of the image [9]. Determining the minimum value of the Similarity distance affects the number of iterations or the maximum epoch in determining the weight, so the iteration will stop when the similarity distance is reached even though the maximum input value for epoch has not been fulfilled [19]. Fourth, the limitations in taking face images using a camera are: the face image is taken from a web camera, taken in a closed room with an area of 12m 2 and a room height of 4m, with LED Cool Daylight 6-Watt, brightness level of 470 Lumen. The distance between the camera and the face is 40-50cm. Because the distance and lighting levels have been limited, this system only allows to recognise skin colour according to the training image stored in the database. If in the testing process, the image to be tested is an image using glasses, then one of the images using glasses must be entered into the training image database. As in the case with face expressions, and the level of face tilt, there must be an image of a face with several face expressions and the level of face tilt entered in the training image database [7].

VI. CONCLUSIONS
Based on the FAR and FRR values of the hidden layer comparison by using the Backpropagation method, the best number of hidden layers is 30. Based on the FAR and FRR values of the face image recognition process, the Backpropagation method is better than the KSOM method. Based on the level of the face tilt to the right, left, up, and down, the Backpropagation method shows a higher success rate than the KSOM method. Face images with a tilt turn to the right or left are easier to identify than those with an upward and downward tilt. The degree of tilt to the left or right of 15⁰ is the most easily recognised face tilt. Based on the face expression and use of accessories, the success rate of the Backpropagation method is higher than the KSOM method. However, the Backpropagation and KSOM methods cannot recognise faces with a mask. Based on the time of the training process and the recognition process, the KSOM method is faster than the Backpropagation Method. Funding: This research received no specific grant from any funding agency.