Data Augmentation and Bayesian Methods for Multicategory Support Vector Machines

: The support vector machine (SVM) has become very popular within the machine learning literature. Recently, SVM has received much attention from statisticians. It is well known that for multicategory classification problem, the commonly used multicategory SVM is based on the frequentist framework. In this paper, we develop a multi-class support vector machine under the Bayesian framework. Numerical studies were performed by EM and the Bayesian algorithm Gibbs sampler. Our results have shown that the classification accuracy of the Bayesian approach is comparable to that of frequentist approaches, while Bayesian approach also has the advantage of providing estimates of uncertainty in predictions.


Introduction
Recently, Support vector machine has drawn attention from the statistics community during the high popularity of SVM rising in machine learning literature. The well-known two category SVM for the binary classification problem can be interpreted geometrically with a hyperplane which gives the maximum margin of discriminating one class from the other. See Boser, Guyon, & Vapnik [1], Vapnik [2], and Burges [3]. In two category SVM, the separation is achieved by a hyperplane which has the largest distance to the data of the two groups.
Bayesian approach, which has been developed rapidly during the past thirty years, plays a very important role in statistics. Nicholas G and Steven L [4] applied the Bayesian approach to SVM classification problem. In their paper, they developed a latent variable representation of original SVM, which enable to use EM or MCMC algorithms do parameter estimation. In their method, data augmentation methods can be formulated in terms of complete data sufficient statistics, which is a considerable advantage when working with large data sets, where most of the computational expense comes from repeatedly iterating over the data. Methods based on complete data sufficient statistics need only compute those statistics once per iteration, at which point the entire parameter vector can be updated. [4] binary case based on a k-dimensional predictors,where the class labels are either 1 or -1. And the -norm regularized support vector classifier seeks minimizing: where is the standard deviation of the 'th predictor variable, except that = 1 for the intercept term, and is a tuning parameter. [11]

Model and Notations
In this section, we will extent this model to the multicategory case. Consider a k-category classification problem with a vector of predictors = (1, , . . . , " ).
Denote N as the class label of example i, then the pseudo-likelihood can be rewritten as Based on the objective function in (2), consider the exponential power prior distribution for which contains the regularization penalty as follows In general, consider . ∈ (0,2] where . = 2 corresponds to the "ridge regrssion" and . = 1 corresponds to the "lasso". Then the prior regularization penalty can be expressed as a scale mixture of normals. where %(_ ) |.) is exponential with mean 2 for the special case of . = 1.

Conditional Distribution
According to the above computations, especially equations (7) and (9), the support vector machine pseudo-posterior distribution can be expressed as the marginal of the complete data pseudo-posterior distribution as follows: The full conditional distribution of given M, _, According to equation (10), the full conditional distribution of ) for any r=1,...,k is Define the matrices Λ ) = *#<k(M ) ), Ω ) = *#<k(_ ) ) , where the diagram elements of Λ ) and Ω ) are the elements of M ) and _ ) , respectively. And = *#<k( C , . . . , " C ). Also let m ) denote a matrix with row # equal to, , the predictor vector of the #'th subject in Θ ) . This model can be written in hierarchical form: where o Y G and o E G are vectors of iid standard normal deviates with dimensions matching ) and M ) . Thus, for ) has a conditional normal posterior distribution given by: The full conditional distribution for M ) and _ ) given , , Consider the conditional distribution of M ) for r=1,..,k and # ∈ Θ ) . Note that from the complete pseudo-posterior distribution we can get: This implies that: For the full conditional distribution of _ ) , it is proportional to the integrand in equation (9). In general this is complicated because its prior density %(_ ) |.) is generally not available. However, for the two special cases of . = 1 and . = 2 closed from solutions are available. When . = 2, %(_ ) | ) ) is a point mass at 1. For . = 1, the full conditional distribution of _ ) is: Later we will use these distributions to develop learning algorithms.

Point Estimation by EM and Other Related Algorithms
In this section, we use the distributions obtained in Section 2 to construct EM-style algorithms to estimate the coefficients. First, an EM algorithm for learning with a fixed value of the tuning parameter is developed. Then we develop an ECME algorithm to learn and simultaneously.

Learning • with Fixed €
With the augmented data M and _, the EM algorithm is a iterative method for finding posterior modes or MLEs. From equation (12), we know that the posterior distribution of the ) 's for r=1,...,k are independent. So we can estimate them separately using the EM algorithm. For ) , the E-step and M-step are defined by ), given ) and the observed data.
As we discussed before, the result for _ ) would depend on the value of .. Focus on the case where . = 1. And according to equation (16), we can obtain that: Recall that the conditional posterior of Ž follows a multivariate normal distribution. Thus, the posterior mode will be the same as the posterior mean. By equations (13) and (14), we can get the following algorithm.

Learning • and € Simultaneously
In order to learn and 2 together, the generalized expectation-conditional maximization algorithm (ECME) is used, where the last"E" represents the conditional maximization of either function. To implement the ECME algorithm, we assume a inverse gamma prior distribution for : Combine this prior with equation (8), we can find the conditional posterior density of given and .
The following algorithm can be obtained with minor modification of the EM-SVM algorithm.
Algorithm: ECME-SVM E-Step Identical to the E-step of EM-SVM with = (…) . CM-Step Identical to the M-step of EM-SVM with = (…) .

Fully Bayesian Multicategory Support Vector Machines
In the MSVM framework of [15], following [12,13], we can find f(. ) by minimizing the following penalized function when . = 1, or equivalently, where is the standard deviation of the ′element of Ÿ, is a tuning parameter and ⊝ ) = {#: N ≠ h} , N is the classification number of observation #.
The minimization problem (19) can be viewed to find the mode of pseudo-posterior distribution from the Bayesian perspective. That is %(β| , y) ∝ exp(−*(β, )) ∝ 4( ) (y|β)%(β| ) where 4( ) is a normalization constant. According to the form of the objective function, we can adopt the following likelihood function for the data and assume a exponential power prior for β as follows;

Data Introduction
TIMSS represents the Trends in the International Mathematics and Science Study, which is one of the most important and largest global studies in education achievement. TIMSS was first conducted in 1995, and it collected data every four years on the achievement of fourth and eighth grade students internationally. TIMSS 2011 is the fifth in the series of assessments. 32 countries participated in the TIMSS 2011 assessments. There is enormous diversity among the TIMSS countries-in terms of economic development, geographical location, and population size. Countries participating in TIMSS aim for a sample of at least 4,500 students to ensure that there are enough respondents for each item. This study dedicates to improving teaching and learning in mathematics and science by comparing different country and exploring the environmental factors that predict students' achievement.
The TIMSS 2011 International Database includes the data from instruments that were administered to the students, their parents, their teachers, and their school principals. These include the student responses to the achievement items: TIMSS mathematics and science and students, teacher and principals' responses to the student, home, teacher, and school background questionnaires. This is a large dataset with six files in it: school background data files, student background data files, student achievement data files, home background data files, student-teacher linkage files and teach background data files.
We used the 4 ²³ grade USA data in the dataset. 12569 U.S students participated in this study. Students were administered a background questionnaire with questions related to their home background, school experiences, and attitudes toward reading, mathematics, and science. An example question is "About how many books are there in your home? (Do not count magazines, newspapers, or your school books." The student background data files contain students' responses to these questions. It includes 15 variables. Each variable may have multiple items. For example, mathematics self-concept has 7 items. We will use the mean of the items of a variable as the score. Except those questions especially designed for learning science, 13 potential predictors from the student background data files were used in the model. There are two demographic variables: sex and age; six environmental variables about recourses and parental support; five psychological variables about students; feeling and experiences. Figure 1 shows the description of the predictors and their descriptive statistics.   Since the data we used are from the U.S which is a developed country, some of the measures might be skewed or reaching a celling effect. For example, ASBG05S ranges from 0 to 2 with a mean at 1.57. It is negatively skewed.
The predictor is student's achievement level. Students of highest achievement are in group 5 which the students of the lowest achievement are in group 1. Figure 2 and 3 shows the frequency and percent of each group.

Traditional Methods
Before applying the new method, three traditional methods were used to classify the current data.
Firstly, we used Linear Discriminant Analysis (LDA). Two types of prior were used: proportional and equal. The results showed that when proportional prior were used, the total error rate was 57.67%. The error count estimates were highest at the two ends. For group 1 and group 5, then error rate were both larger than 90%, while for group 3 and group 4, the error rates were about 40%. When equal prior were used, the total error rate was 62.16%, but the error rate were highest in the middle groups (all three groups had an error rate that was larger than 75%). The error rates at the ends were smaller, 47.80% for group1 and 34.62% for group 5.
Secondly, apply QDA on this data. As LDA, Two types of prior were used: proportional and equal. The results showed that when proportional prior were used, the resubstitution error rate was 56.23%. The error count estimates were highest at group1 (95%), while for group 3 and group 4, the error rates were about 50%. When equal prior were used, the total error rate was 58.76%, but the error rate were highest in the middle groups (all three groups had an error rate around 70%). The error rates at the ends were smaller, 43.40% for group1 and 29% for group 5. Comparing with LDA, QDA seems to perform a little better. Finally, we used a One-against-one (pairwise) SVM technique to analyze the current data. We used the ksvm function in the kernlab package. The training error was 63.96%, and the 10-fold cross-validation error rate was 74.93%. This error rate is larger than LDA and QDA. This indicates that the one-against-one (pairwise) SVM technique may not be appropriate for this dataset.

EM-algorithm
Two parts in EM algorithm were studied. We first consider fixed and study the relationship between different value of and number of non zero parameter estimates. When _ = ∞, it follows that = 0, in which case we can simply ignore the th column of covariate matrix. Figure 4 shows how the value of affect number of non zero parameters.  Vertical values are mean based on ten times computation. To study point estimate of , we select 0.9. In our data set, there are 13 different covariates included, which is not too many. However, in some dataset, dimension of X could be very large. We consider a new convergence criteria, which can be called as likelihood convergence. We identity the convergence when likelihood does not change too much in iteration. Figure 5 shows the convergence of log-likelihood. The log-likelihood increase very fast within first 20 iterations and slow after.
When estimating, is a vector includes ) , … , ·) . We let M 1 and estimate of are computed from dataset directly. Log-likelihood convergence is one way to meet our goal, however, this criteria can not guarantee the stability of parameter estimates. Table 1 gives the results of estimation and Figure 3 shows the variance of parameter estimates. All variance are obtained based on 10 time computation. In group 1, 3, 4, have largest variance. Variance of ¸ and ¹ in group 4 are large compared with other estimate. In group 4 and 5, variance of and C are large. -0.509 0.168 -0.166 0.000 0.000 Figure 6. Variance of Estimate.

Bayesian MSVM
First, we test our Bayesian MSVM on a dataset , named wine, in the UCI data repository. It classifies a given silhouette as one of four types of vehicle based on 18 predictors. 500 training samples were selected and the remaining 346 samples as the testing set. All the 18 predictors were used as input variables and all the inputs are normalized to have zero mean and unit variance. We ran our MCMC algorithm for 1000 iterations and burn in 500. Figure 7 is the sample paths for the coefficients for two predictors, there is no trend among them. In fact, the sample paths of all the parameters are stationary, which indicates our MCMC is convergence. The classification error on the training set is 39% 195/500 , and the prediction error for the testing set is 36.99% 128/346.
Then, apply our Bayesian model to the TIMSS data set, we randomly select a subset with sample size of 1000. In our procedure, 13 predictors were used as input variables and all the inputs are normalized to have zero mean and unit variance. We ran our MCMC algorithm for 1000 iterations and burn in 500. Figure 8 is the sample paths for the coefficients for two predictors, there is no trend among them. In fact, the sample paths of all the parameters are stationary, which indicates our MCMC is convergence. The classification error for the training set is 22% 220/1000, and the prediction error for the testing set is 26.6% 133/500. It indicates that the variable selection works pretty well.

Discussion
In the Bayesian MSVM framework, we need to use the truncated multivariate normal distribution to satisfy the sum to zero constraint on f . ,i.e. ∑ ) & ) x 0 for # 1, ⋯ , u, but it sacrifices the efficiency of the computation. In our ca suppose & ) x x β ) If we let then the sum-to-zero constraint is equivalent to X ∑ ) β ) 0 . If the design matrix X is of full rank, then ∑ ) β ) 0 " or D1 0 " can guarantee the constraint. Following [14], one possible solution could be taken the reparameterization procedure below D BH (32) where H I 1 1 . However, since the matrix H is singular, it is impossible to get the density distribution for the parameters B from the distribution of D by using the density transformation formula. One possible solution for this issue could be solved by looking for a kernel function corresponding to the -normal penalty in (18).

Conclusion
This paper considered the Bayesian multicategory support vector machine (MSVM). For the problem, a Bayesian framework for MSVM was developed based on stochastic variational inference and inducing points. The proposed Bayesian approach is robust against outliers with advanced accuracy and guaranteed error rate similar to the frequentist formulation of the SVM. The Bayesian method also has the advantage of modeling with high flexibility, automatic parameter tuning, and providing estimates of uncertainty in predictions. The numerical studies suggested that the proposed method has good predication accuracy and works well for practical situations.