An Efficient CNN Model for Automated Digital Handwritten Digit Classification

Background: Handwriting recognition becomes an appreciable research area because of its important practical applications, but varieties of writing patterns make automatic classification a challenging task. Classifying handwritten digits with a higher accuracy is needed to improve the limitations from past research, which mostly used deep learning approaches. Objective: Two most noteworthy limitations are low accuracy and slow computational speed. The current study is to model a Convolutional Neural Network (CNN), which is simple yet more accurate in classifying English handwritten digits for different datasets. Novelty of this paper is to explore an efficient CNN architecture that can classify digits of different datasets accurately. Methods: The author proposed five different CNN architectures for training and validation tasks with two datasets. Dataset-1 consists of 12,000 MNIST data and Dataset-2 consists of 29,400-digit data of Kaggle. The proposed CNN models extract the features first and then performs the classification tasks. For the performance optimization, the models utilized stochastic gradient descent with momentum optimizer. Results: Among the five models, one was found to be the best performer, with 99.53% and 98.93% of validation accuracy for Dataset-1 and Dataset-2 respectively. Compared to Adam and RMSProp optimizers, stochastic gradient descent with momentum yielded the highest accuracy. Conclusion: The proposed best CNN model has the simplest architecture. It provides a higher accuracy for different datasets and takes less computational time. The validation accuracy of the proposed model is also higher than those of in past works.


INTRODUCTION
Handwriting plays an important role in everyday life, not only as a medium of communication, but also for legal documentation. Each person has inherent writing patterns. This diversity often makes it difficult to recognize and read [1]. Different computer-based recognition systems are thus applied to identify the correct handwritten information. Several deep learning algorithms were already applied in the handwriting recognition research area. They provide better performance than conventional ways and take less time to recognize, especially when a large amount of dataset is involved. Some significant, practical applications of this research include signature verification of bank check, vehicle plate detection, digit classification from images, information extraction from historical documents [2] and so on. This recognition process uses handwritten images as a dataset. Firstly, features from the image are extracted and then classification is performed. Characters, numbers, cursive texts are mainly used in this recognition process [3]. Automatic classification technique is challenging to be applied in this area due to the high variety of individual writing patterns.
Recognition or classification process is done from text images and, thus, is known as optical character recognition [4]. This plays a crucial role in many applications, especially for commercial purposes. Various datasets are used for recognition in this research paradigm [4]. Different authors have used different methods to classify various digits and characters. Authors in [5] used convolutional neural networks to recognize Kannada, a south Indian character. 497 classes were included in the experiment dataset. AlexNet was used for training and the accuracy obtained was 92% from handwritten text. Another CNN model was used for Devanagari numerical recognition by authors in [6] and their 43 proposed model combined Genetic Algorithm (GA) with CNN. Data pre-processing was used, and so was feature extraction before training. The obtained accuracy was 96.06%. Authors in [7] used GA and support vector machines for the classification of Bangla numeric digits. Features were extracted using Genetic Algorithms and the dataset was trained by support vector machine (SVM) classifier. The proposed model yielded 97.70% accuracy using a total of 6,000 data. A combination of Restricted Boltzmann Machine (RBM) and CNN was implemented by the author in [8] for handwriting recognition of Arabic digit. Two phases were proposed: first, meaningful features were extracted from the dataset and then Arabic digits were classified by CNN classifier. The accuracy reached 98.59%.
Another DNN based Bangla handwritten digit classification work had been done by authors in [9]. They worked with more than 85,000 image data and emphasized on the pre-processing steps. Five pre-processing steps were used before CNN training. Their proposed model consisted of six convolutional layers and two dense layers. The validation accuracy obtained was 98.57%. Another magnificent work was done by authors in [10] as they worked with five different types of language data to test the accuracy of their model. Raw data were firstly normalised and then five CNN layers were used, which was similar to the pre-trained DNN LeNet-5. They found that the experimental accuracy was 98.38% for Bangla digits, 97.2% for Oriya digits, 98.54% for Devanagari digits, 96.5% for Telugu digits and 99.1% for English digits. Authors in [11] presented a Korean Hangul recognition using deep CNN. Their architecture was composed of three main layers: convolutional layer, max-pooling layer and classification layer and they used two datasets. They observed the accuracy of 95.96% for SERI95a dataset and 92.92% for PE92 dataset. In this way, authors of different papers found different numerical classification accuracy by implementing various methods. In the case of English digit recognition, some false predictions seem to have lowered the accuracy.
The novelty of this work is to evaluate an efficient model for English handwritten digit classification by improving the classification accuracy and minimizing the computational time of the computer vision recognition system. A shallow five CNN architectures are proposed to obtain the best model. The ideal model is the simple one but with fast computation. Traditional classification methods undergo pre-processing [12] and image segmentation time before the classification process. Again, pre-trained networks of transfer learning contain a sizable number of layers and thus take more time to compute vast layers [13]. There is a trade-off between computational time and accuracy [14], but simplex CNN models are considered capable in classifying more accurately with less consuming time.
The remaining parts of this paper are as follows. Section II presents concise discussion of related work and background study of CNN. Section III describes the proposed methodology. Section IV presents the obtained results. Section V presents the analysis and the comparison. Finally, Section VI concludes the overall work.

II. LITERATURE REVIEW
In the present research area, Deep Neural Network (DNN) is a well-recognized application for image recognition, object classification, speech recognition and others [8]. It has the ability to provide higher classification accuracy. Authors in [15] proposed a CNN-based model for handwritten digit recognition from MNIST dataset and their model consisted of eight layers. They claimed that their proposed architecture is able to provide an improved accuracy of 98.85% within 8569 seconds. Another CNN model was presented for MNIST dataset in the same manner by authors in [16] in which CNN models consisted of seven layers. This model included one input and one output layer with five hidden layers in the middle. The best accuracy was evaluated by varying the number of hidden layers from the model and number of epochs and finally the best-obtained accuracy was 99.21% within 15 numbers of epochs.
A minimal model for handwritten digit recognition based on CNN was presented by authors in [17]. They maintained that their minimal model reduced the mathematical computation for recognition from MNIST dataset. Their model is similar to LetNet5 and obtained 99.5% accuracy. A CNN model of simplified version was proposed for MNIST database by authors in [18] and they compared their simplified model with LeNet-1, LeNet-5. They showed that their model was able to reduce the error rate by 0.7% which is lower than other networks. Three different approaches: DNN, DBN, CNN were proposed by authors in [1] for digit recognition. Before neural network training, pre-processing, segmentation, and feature extraction were performed. They found the best accuracy of 98.08% from the DNN approach. Seven layers CNN model was proposed by authors in [19] for MNIST dataset. They found the 95.7% test accuracy within 500 epochs from their proposed CNN model. The combined architecture of CNN along with Deeplearning4j (DL4J) was presented for MNIST dataset recognition by authors in [4]. They avoided any kind of pre-processing steps and obtained 99.21% accuracy from their model. Another recognition system with Multilayer Perceptron (MLP) was proposed for MNIST dataset by authors in [20]. For training purposes, they used back-propagation and for validation purposes, they used feed forward networks. They evaluated the best result by varying the number of iterations. They used 5,000 data and obtained 99.32% accuracy within 250 iterations. For automatic handwritten digit detection from MNIST dataset, authors in [21] proposed a four layer CNN model. This simplified model was trained for different epochs and finally 98% accuracy was obtained. Another effective classifier is Support Vector Machine (SVM) which was used for MNIST dataset recognition [22] [23]. A composition of CNN and SVM system was proposed [20] for MNIST dataset recognition. Features from the digit images were extracted using CNN; and SVM was utilized for image classification at the output layer. Their proposed method applied a pre-processing step before CNN application and achieved accuracy of 99.28%.
Past deep learning research shows that it is an excellent performance provider for image classification, as well as for handwriting classification [9]. CNN is a part of deep learning technique. Basic CNN structure consists of one input layer, manifold hidden layers and one output layer. Image classification using CNN includes two main stages: extraction of image features and classification [17]. CNN building blocks accommodate some important layers, and the descriptions are given as follows.

A. Input Layer
This layer is the first layer of the CNN blocks. Input layer looks through the input images for the next blocks of CNN [17]. It represents the image data as a three-dimensional matrix and defines the input layer neurons from this information.

B. Convolution Layer
In this layer, filters are applied to extract the features from images [24] [25]. Various sizes and numbers of filters can be applied in this layer as necessary. These filters are also known as convolution kernels. Convolution operation performs the between input images and filters [17] [25]. It creates a feature map. The mathematical operation [16] is seen on (1).
where, ⨂ = convolution operation, ( , )= expressing input image matrix, ( , ) = filter or kernel function. Fig. 1 Visualization of convolution layer [26] Fig. 1 demonstrates how the filter or kernel moves across the input image and performs the convolution operation in this layer. This operation produces an integer value, and this process is continued for the full image. This result is passed to the next layer.

C. Batch Normalization Layer
In order to adjust, scale of the input layer, a purpose batch normalization layer has been used. This layer speeds up the training process. It helps the network to gain stability. It is relatively similar to the dropout layer, which is applied before the activation layer.

D. ReLU Activation Layer
Different types of activation functions are employed in deep learning. Sigmoid activation function is responsible for overlooking the image information and for that, most of the network uses the ReLU layer. ReLU or Rectified Linear Unit is a non-linear activation function that is simple to use and performs faster. The mathematical expression [19] is seen on (2). Equation (2) is representing the ReLU function, which is represented graphically in Fig. 2.
Graphical representation of ReLU function which is expressed in equation [27] E. Pooling Layer Pooling layer is the process of shrinking data volume of the previous network layer [15] and accelerated the network computation. It is employed within two convoluted layers to reduce the dimensional [28]. Generally, two types of pooling are added to the network: max-pooling and average pooling [15] [28]. The max-pooling layer takes the highest value from convoluted output and diminishes the size. This process is also known as down-sampling [29]. It can be shown from the following Fig. 3. In Fig. 3, the filter size, and the stride size both are kept the same 2×2, taking the highest value from the sub-region [29]. Fig. 3 Visualization of max-pooling layer [28] F. Softmax Layer or Activation Function Softmax is a logical approach which can be used for a multi-class classifier [17]. This function is mainly used in the final layers for classification and provides the probabilistic output for the final layer. The mathematical expression [30] is seen on (3).
where, = input vector, = elements of input vector, = standard exponential, K= number of classes.

G. Fully Connected Layers
This layer is the arrangement of a normal neural network or feed-forward network [28]. In this layer, each node is directly connected to the past and next layers [29] [31]. A fully connected layer leans the feature from its previous layer. One or more fully connected layers can exist. The last FC layer is commonly known as the output layer, which predicts the desired classes. This part is included in the CNN classification that works after the feature extraction.
The proposed CNN architecture is slightly different and novel from earlier approaches for the handwritten English digit recognition with different datasets. The proposed CNN model is better than others and simpler, as well as less time-consuming. To manifest the effectiveness, two datasets were used. The improved accuracy shows that the automated CNN classifier of the proposed method is an effective approach.

III. METHODS
The proposed model uses CNN for classification of handwritten digit recognition. Five architectures are presented to explore the best performance. In this section of data specification, architecture details and model parameters are described in more details.
A. Data Details For this handwritten classification experiment, MNIST dataset [32] (Dataset-1) and a handwritten dataset from 'Kaggle' [33] (Dataset-2) were used. The datasets consist of 10 English digits (0-9) of different writers. From these datasets, 12,000 image data of MNIST and 29,400 image data of Dataset-2 were used for the experiment. The digit images of MNIST dataset are at the size of 28 × 28 pixels and the digit images of Dataset-2 are at the size of 60× 60 pixels. A portion of the experimental datasets are shown below in Fig. 4.   Table 1 shows how training and validation data are split for the CNN architectures. From the total 12,000 MNIST data, 9,000 data were used for training of the network and 3,000 data were used for validation. Again, from total used of 29,400 data from Dataset-2, 23,520 data were used for training of the network and 5,880 data were used for validation.

B. Proposed Five CNN Architectures
Five different CNN architectures are proposed for handwritten digit classification to explore the best accuracy which is shown in Fig. 5. These proposed models are simpler than the pre-trained deep network and less timeconsuming. Pre-trained network architecture refers to the saved model of a network, which is already trained with an immense amount of image data. It requires fine-tuning before being used. Pre-trained networks are built-in models and users do not need to establish network models for their own problem. GoogleNet, AlexNet, VGG16, LetNet 5, ResNet 50 are the most popular pre-trained networks. At first, MNIST data were inputted at the networks for experiment. Then, the experiment was done for the Kaggle digit dataset. SGD is a superior optimizer, which shows faster performance. Network's parameters are kept the same for all CNN architectures, which are presented in Table  2. Here, Fig. 5(a) is architecture 1, where the network consists of three consecutive convolutional layers, one pooling layer and finally output layer. Fig. 5(b) has the same structure as Fig. 5(a) but the main difference is the filter size and the number of each layer. Fig. 5(c) consists of two convolutional layers, two pooling layers and two fully connected layers, where the final FC layer is the output layer. Fig. 5(d) consists of three convolutional layers, two pooling layers and three fully connected layers. The last CNN architecture of Fig. 5(e) consists of four convolutional layers, two pooling layers and two fully connected layers. The 1 st convolutional layer used 32 filters, the 2 nd convolutional layer used 64 filters, the 3 rd convolutional layer employed 84 filters and the last convolutional layer used 124 filters. Different numbers of FC layers and max-pooling layers were used in the models to observe the effect.  C. Algorithm General algorithm for this proposes classification process is described below dataset step by step: Step 1: Epoch number= 10, total number of classes= 10 Step 2: Input image dimension 28×28 (For MNIST Dataset) Step 3: Input digit images loading from MNIST dataset Step 4: Data splitting: training data = (9000,28,28,1) and validation data = (3000,28,28,1) (For MNIST Dataset) Biswas Step 5: Creation of CNN model Step 6: Network training Step 7: Observation of validation accuracy and loss.

IV. RESULTS
MATLAB 2018a was used for this experiment environment, which was done in Core i5-7200U processor laptop. The proposed five different architectures were run in MATLAB for both datasets separately with respectable parameters ( Table 2). After the CNN architectures are successfully implemented, the simulation provides worthwhile values of the validation graph. The simulation took some time for training. The validation graphs and loss graphs are provided in Fig. 6, where all the parameters are kept the same for each architecture. Then, the accuracies were observed for different datasets. Fig. 6 presents the training, the validation accuracy, and the loss contingency of the proposed five CNN models, where MNIST dataset was used. Fig. 6 (a), Fig. 6(b), Fig. 6(c), Fig. 6(d), Fig. 6(e) indicate the training, the validation accuracy, and the loss contingency of the proposed architecture 1, architecture 2, architecture 3, architecture 4 and architecture 5, respectively. The blue line in the graph indicates the training progression, the black dot line indicates the validation accuracy, and the red line indicates the loss of data in progression of the inputted MNIST dataset. Fig.  6(e) demonstrates less data loss than other architecture graphs and so, it provides the best performance for the classification.
The validation accuracy and loss of the proposed CNN models for handwritten digit recognition are recorded in the following Table 3. This table is presented to help explore the best result and to show the best model among all proposed models.   [15]. Again, the traditional recognition processes are also time-consuming for several processing stages [8] [37]. Fig. 7 shows the validation performance of proposed five CNN architectures and among them, architecture 5 resulted in the highest accuracy. Again, the MNIST dataset provided a higher accuracy than Dataset-2. Fig. 8 presents the graphical representation of error rates of the proposed CNN models for both datasets. It shows that the error is reduced gradually up to 0.47% in case of MNIST dataset or Dataset-1. Architecture 5 resulted in less error rate, which means it can classify the handwritten digits more accurately. Fig. 8 Graphical representation of the error rates of the proposed CNN models Fig. 9 is the confusion matrix of architecture 5 for MNIST dataset, which shows that this model can achieve the highest accuracy of 99.5% (i.e., round off of 99.53%) for ten-class classification. The diagonal values represent the individual class accuracy. The CNN model performance firmly depends on network optimizer. Reduced data loss, higher accuracy and less convergence time are the fundamental reasons for using optimizers. Various optimizers exist. Stochastic Gradient Descent (SGD) is the most used optimizer in deep learning as it is faster. Computational redundancy can be avoided by (SGD) optimizer [28]. By adding the 'momentum' parameter with SGD algorithm, optimization acceleration is achieved [28]. Adaptive moment estimation is known as 'Adam' optimizer [28], which is an improved version of stochastic gradient algorithm [28]. Adam optimizer is efficient and less memory-consuming [34]. It carries both AdaGrad and RMSProp feature combinations [34]. Continuously, per weight of squared gradients are calculated by moving average in root mean square propagation or RMSProp optimization [35]. It can improve the network performance by calibration. The proposed best CNN model is achieved for the classification with SGD optimizer. For 51 experimental purposes, two other optimizers were applied to the best CNN model (architecture 5) with 0.01 learning rate to explore the performance, which is shown in Table 4. That model provides 98.33% accuracy when the network used 'Adam optimizer' and 98.03% accuracy when the network implemented 'RMSProp optimizer'. This experiment observation confirms that SGD optimizer provides the best validation accuracy for this network model.  Table 5 shows the effect of various epochs on the same model of architecture-5. Epoch indicates how many times the algorithm will work for the training datasets. Upgrade of weights is done per epoch. Theoretically, the increment of epoch number is able to improve the accuracy but too many epochs can cause an overfitting problem for the network. Table 5 also shows that when the epoch number was increased, the accuracy was increased. However, when the number of epochs reached a certain threshold, it was unable to fulfil the total iterations and the accuracy did not improve much than the previous. So, the optimum number of epochs must be set. The result of the proposed five CNN models shows that architecture 5 is the best model for the handwritten digit classification perfectly. This model proved the best result of 99.53% validation accuracy using the MNIST dataset. This best model or architecture 5 consists of four convolutional layers, two max-pooling layers and two fully connected layers, where the last fully connected layer is the output layer for ten-class classification. The proposed best result was obtained from the five CNN models and the best CNN model yielded the highest accuracy. A filter at the size of 3×3 remained the same for all convolutional layers in the architecture 5 CNN model. The 28×28 size of MNIST dataset was input to this model. 32 of 3×3 kernels were applied, and the feature maps were obtained, and then were convoluted again with the same size of kernel and 64 filters were added.
The second convolution layer's feature maps were then sub-sampled. After that, the values were passed through 84 of 3×3 size filters. These feature maps are convoluted again with 124 of 3×3 size filters. Finally, values were used for sub-sampling again. Then, the feature maps were passed into a fully connected layer with 50 neurons. The final fully connected layer used 10 neurons for the ten-class classification. This model was trained and validated with 12,000 image data. The obtained accuracy was 99.53% within 10 epochs. FC layers were used for the labelling to classify further. The feature value was taken and passed to the next layer to make the final prediction. Max-pooling layer was used to reduce the feature information, which was obtained from the previous two convolution layers. It helped to reduce the network size and provided fast training.
Architecture 1 took the most time than others as it has a max-pooling layer after 3 consecutive convolutional layers. But architecture 5 has only one max-pooling layer after 2 consecutive convolutional layers. Therefore, the information is reduced, and less time is taken. In this way, the FC layers and max-pooling layers have an impact on the best model. The obtained best model results in high performance without performing any pre-processing cost. This result is much better than the existing classifiers.  Table 6 demonstrates that there are existing methods for classifying handwritten digits and the proposed best CNN model is capable of achieving 99.53% accuracy. This accuracy is enhanced compared to the previous works. There are several reasons. First, the proposed method does not use any pre-processing steps that mean it does not have any effect on the result. Next, the network modelling is the MNIST dataset. Also, CNN is one of the best feature extractor and classifiers among other machine learning algorithms [46]. There is a hypothesis that an enhanced layer can provide a better accuracy, but this must be correlative with the image size. MNIST is a well-recognized dataset and consists of 28 × 28 pixels image data. So, higher accuracy cannot be obtained by using an increased number of CNN layers [4]. For that reason, the proposed five shallow CNN models were proposed and the best model (architecture 5), which consists of 4 convolutional layers, was achieved.
Architecture 5 uses the highest convolutional layers than other architecture models in a proper way to extract the features perfectly from an image. Much smaller networks are unable to extract features perfectly and so they result in a lower accuracy. Another reason for the improved accuracy is setting the proper optimizer for the network (as shown in Table 4). Again, a proper learning rate also affects the accuracy. All things considered, the proposed simplified CNN model is effective enough because it can classify the handwritten digits easily and accurately.
Nevertheless, it should be noted that the proposed best CNN model deals only with MNIST dataset for this experiment purpose. Therefore, it limits the results of the experiment. Other types of dataset were not involved in the current study to observe the result of CNN model. Future research will benefit from working with data augmentation and other datasets to explore the CNN model performance in greater depth.

VI. CONCLUSIONS
The main concern of this work is to improve the English handwritten digit classification accuracy by using deep learning. Handwriting classification has various practical applications. The current research used 12,000 standard