Off-Line Handwritten Character Recognition System Using Support Vector Machine

: Selection of classifiers and feature extraction methods has a prime role in achieving best possible classification accuracy in character recognition system. Issues of character recognition system related to choice of classifiers and feature extraction methods can be resolved through these objectives. In this proposed work an efficient Support Vector Machine based off-line handwritten character recognition system has been developed. The experiments have been performed using well known standard database acquired from CEDAR, also seven different approaches of feature extraction techniques have been proposed to construct the final feature vector. It is evident from the experimental results that the performance of Support Vector Machine outperforms other state of art techniques reported in literature


Introduction
Pattern recognition has been hailed as one of the most fascinating and challenging branch in the field of artificial intelligence and optical character recognition. It plays a very important role in automation of postal services, bank processing, and document reading etc. Conversion of the handwritten text into a notational representation is called Handwritten Recognition. It is a special problem in the field of pattern recognition and machine intelligence. Online and off-line are the two different modes of character recognition depending on the type of data available. Special hardware (e.g. smart pen or pressure sensitive tablets), which is capable of measuring pen's pressure and velocity is used in online character recognition which involves the identification of character while they are being written. On the other hand for off-line character recognition, scanned digital images of characters written on a paper by pen are used. Since the early 1990s, systems for recognizing machine printed text have been in widespread use on desktop computers, but these systems originated in the 1950s. In the early 1990s, Hidden Morkov Model (HMM) [1] [2], combines image processing and pattern recognition with artificial intelligence and statistical technique very effectively and efficiently.
The paper is arranged as follow. An introduction to SVM is presented in section 2. A state of art of SVM used in offline handwritten character recognition described in section 3. Section 4, provides the details of the database and steps of preprocessing. Section 5 presents the adopted methodology for feature extraction. Section 6 presents the proposed SVM based handwritten character recognition system. Section 7 presents comparative analysis between SVM and other state of art techniques. Conclusions of the paper are presented in section 8.
This paper is an extension of previous work done by authors in the field of off-line handwritten character recognition [3] [4] [5].
recognition. It is considered to be the state of the art tool for linear and nonlinear classification. It comes under the class of supervised learning algorithm.
SVM classifies data by finding best hyper plane that separates all data points of one class from those of other classes. The meaning of best hyper plane is that one which creates the maximum margin between the two classes. Margin means the maximum width of slab parallel to the hyper plane that has no interior points. The points which are closest to the hyper plane are known as support vectors. Those points are lying on the boundary of the slab as shown in figure 1. Initially the SVM classifiers were proposed for binary classification. Generally multiclass SVMs are implemented by combining two-class SVMs either by one versus all or one versus one approach. Suppose the given a set of training points are: ( , ) The quadratic programming, if the data points are linearly separable can be expressed as: And if data are linearly inseparable: Where C= Soft margin parameter, i ζ = Slack variable and b =Bias.
In the case of linearly inseparable feature space the data points are mapped into a higher dimensional space by the function φ known as kernel function. The kernel function can be expressed as:

SVM for off-Line Handwritten Character Recognition
A subfield of Artificial Intelligence is machine learning and it is associated with the development of techniques and methods which helps the computer to learn. Support Vector Machine (SVM) was first heard in 1992, introduced by Boser, Guyon, and Vapnik in COLT-92. Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. Support vector machine is one of the state of art classification techniques for off line handwritten characters. The E. Frias-Martinez et al. [8] presented an efficient SVM based off-line signature recognition system and also compared its performance with traditional MLP classifier. For unconstrained Malayalam character recognition, Jomy John et al. [9] have proposed schemes having two stages. The first stage is feature extraction based on Haar wavelet transform and for the second stage, they implemented SVM for classification. Handwritten digit recognition on well known data set (CENPARMI, CEDAR and MNIST) using state of art feature extraction and classification techniques have been proposed by Cheng-Lin Liu [10]. Eight classifiers are combined with ten feature vectors. The classifiers include kNN classifier, three NN classifier, Learning vector quantization classifier (LVQ), a discriminative learning quadratic discriminant function (DLQDF) classifier and two support vector classifiers. The support vector classifier with RBF kernel gives the highest accuracy. Optimal selection of local regions of character images of Bangla character for extraction of features by application of GA and SVM combined was proposed by Nirban Das [11]. The authors [12] have presented a cursive character recognizer based on a segmentation and recognition approach. The classification of character is achieved by SVM and Neural gas (NG). NG is implemented to obtain a suitable representation of classes. Recognition rate of SVM classifier is found to be highest among the literature considered for cursive character recognition. Jiang-Yiog Dong et al [13]. have presented a technique to improve nonlinear normalization scheme of Chinese character recognition which was earlier propose by Yamada et al.(1990) [14]. In addition they have proposed a SVM classifier applied to solve large classification problem on a large data set with thousand of classes. SVM found to be very efficient with a very high performance on ETL9B Chinese character database.
A robust and efficient object recognition system have been proposed using Gabor wavelet and SVM [15]. For the recognition of handwritten digits of well known MNIST digit database, the authors [16] have proposed a hybrid model of Convolution Neural network (CNN) and Support Vector Machine(SVM). Here CNN has been used as a feature extractor and SVM performs as a classifier.
SVM is one of the classifier used by the authors [17] for the recognition of Indian and Arabic handwritten numeral characters. Also a hybrid MLP-SVM model has been proposed by Washington W. Azevedo and Cleber Zanchettin [18] to recognize cursive handwritten character. The ability of MLPs to recognize similar character has been improved by specialized local SVM. A hybrid kNN-SVM method has been used for cursive character recognition [15]. SVM are introduced to improve the performance of kNN in handwritten character. A recognition system for handwritten uppercase and lowercase English character has been proposed by Dewi Nasien et al. [20]. Freeman chain code (FCC) is applied for feature extraction and SVM as a classifier. The experiment was conducted on NIST dataset. A comparative study of Devnagari handwritten character recognition using different features and classifiers are presented by U. Pal et al. [21] Classifiers like Projection Distancs (PD), Substance method(SM), Linear discriminant function(LDF), SVM, modified quadratic discriminant function(MQDF), Mirror Image Learning(MIL), Euclidean distance(ED), Nearest neighbor, kNN, modified projection distance(MPD), computed projection distance(CPD) and computed modified quadratic discriminant function (CMDQF) are considered. Genetic Algorithm [22] has been implemented to select the features subset as well as the parameters of SVM classifier.

Contribution of the proposed work:
A combination of seven feature extraction techniques have been proposed to form Hybrid features and developed an Offline handwritten character recognition system using SVM as classifiers which outperforms other techniques proposed in literature.

Steps for Pre-processing and Database
The steps used in the proposed work are as follows: Input, Pre-processing, Feature Extraction and Classifier. All the above mentioned stages and their interconnection for the proposed handwritten recognition system are represented in figure 2.

Preprocessing
The primary goal of the pre-processing is to arrange the information to make character recognition process simpler. The following pre-processing steps have been applied: Binarization: In order to avoid the problems resulted by noise and information lost, the gray scale image of up to 256 gray levels is converted into binary matrix. The global thresholding method has been used for binarization. If the intensity of the pixel is more than a particular threshold value, it is set to white (represented by 1) and otherwise to black (represented by 0).

Slant correction
The slope of the general writing trend with respect to the vertical line is defined as slant. The image matrix is divided into two halves, upper and lower. Centre of gravity of the two halves have been computed. The slope of the connecting line joining the two centre of gravity defines the slope of the window (image matrix) [23] [24].
Smoothing and noise removal Exact surrounding region of a character is found with the help of smoothing. Wiener filter have used to perform smoothing. Further enhancement of the image quality by removing any leftover noise has been performed by Median filter.
Normalization Normalization is required as the size of the character varies from one to another person and also time to time even when the person is same. Normalization helps in equating the size of the character image (binary matrix) so that features can be extracted on the same footing. The character image is normalized to a window size of 42×32.
Input database The benchmark dataset used by researchers is CEDAR (Center for Excellence in Document Analysis and Recognition, USA) CDROM-1for off-line handwritten character recognition. The database has been acquired from CEDAR, Baffalo University [25], USA. Bi-tonal images of alphabetic and numeric characters in the database are divided into two groups. One set contains mixed alphabetic and numeric (BINANUMS) and the other set consists of numeric characters (BINDIGIS) only. The train set has 24947 alphanumeric characters and test set has 2890.

Feature Extractions
For any recognition system, feature extraction is an integral part. Several feature extraction techniques have been reported in the literature for representation of a character [26]. Feature extraction involves representing a handwriting text by a set of discriminative features. The feature representation is based on extraction of certain types of information from the image. The character features which are vital for classification are extracted in this step. Seven sets of feature vectors have been extracted. The methods of feature extraction are as follows: Box approach: Box approach is based on spatial division of character image. Horizontal and vertical grid lines of 6 × 4 are superimposed on the character image of size 42 × 32. In this process 24 boxes, each of size 7 × 8 are devised. Some of the boxes will have a portion of the image and others remain empty. However, all boxes are considered for analysis [23] [24].
For each box the normalized vector distance ( b γ ) and normalized angle ( b α ) are calculated as: Each character image of pixel sized 42 × 32 is divided equally into 24 zones. Size of each zone is 7 × 8 pixels. For each zone the features are extracted by moving along the diagonals of its respective 7x8 pixels. There are fourteen diagonals in each zone, thus 14 diagonal features are obtained. These 14 diagonal features are averaged to form a single feature value which is to be placed in the corresponding zone. Finally, for each character image 24 features were extracted [27]. Figure 3 gives the idea of this approach.
Where N=Total number of pixels in each box.

Gradient operations:
Image gradient is the variations of pixels in horizontal and vertical directions. Image gradient may be used to extract information from images. The gradients have been calculated by using the following formula.
∆f/∆x = Gradient in x-direction. ∆f/∆y = Gradient in y-direction. The gradient has been measured for each box. This process is sequentially applied for all the 24 boxes. From 24 boxes, 48 features have been extracted.
Standard deviation: Standard deviation of pixels in each box has been calculated. This process is applied for all 24 boxes to get 24 features.
Where N is number of rows and x is mean.

Center of Gravity:
Centre of gravity of the pixels in each box is obtained. In this process total 48 features for all the 24 boxes have been obtained.

Proposed SVM Based Handwritten Character Recognition System
SVM library known as LIBSVM running under WEKA tool is used in the proposed experiment. WEKA is universally used tool in the machine learning field. LIBSVM is integrated software for support vector classification (C-SVC and nu-SVC) and regression(C-SVR and epsilon-SVR). The different stages of SVM classification has been shown in figure 4. Two class (binary classifier) SVM is applied to multiclass character recognition problem using one versus all method. C-SVM as the classifier and polynomial function as the kernel type have been used. The values of other parameters have been set to their default values. The SVM is trained with the training samples from CEDAR dataset. The classifier works in two phases: training and testing. After pre-processing and feature extraction training is done by taking the feature vectors which are stored in matrices form.

Conclusions
Support Vector Machines are one of the most effective methods used in the field of patter recognition. The method described in this paper makes use of the preprocessing methods such as binarization, slant correction, smoothing and noise removal, and normalization to make the process of classification easier and more exact. Seven approaches of feature extraction namely, box method, diagonal distance method, mean and gradient operation, standard deviation, centre of gravity and edge detection have been used to develop an efficient off-line character recognition system using Support Vector Machine. The results shown in Table 3 Clearly illustrate the higher performance of the proposed method in the task of recognition of handwritten characters within the dataset used. The proposed method outperforms most state of the art methods examined in the paper for capital letters with an accuracy of 95.74%. The method also performs quite well for lower case alphabets with an accuracy of 92.19% and numeric digits with an accuracy of 97.16%. The timing analysis also shows that the method proposed is fast and efficient while being fairly accurate at the same time. The superior performance of SVM is due to superior generalisation ability of support vector machine in high dimensional space.