Support Vector Machine (SVM) was proposed by Vapnik and Cortes in 1995 [1]. It is a very popular and powerful classification algorithm. The main idea of SVM is to find an optimal separating hyperplane between two classes. Due to SVM’s great classification ability, it has been applied to a wide variety of applications.
Over the past decade, scholars have proposed classifiers on the basis of SVM. Among the extensions of SVM, I’d like to introduce Twin Support Vector Machine (TSVM) [2]. Because it has been received more attention.
TSVM was proposed in 2007 with the idea of binary classification using two non-parallel hyperplanes [2]. Each hyperplane fits the samples of its own class and is far from the samples of other class. To grasp TSVM’s idea, below figure shows the geometrical representation of TSVM classifier.
To explain the math behind this classification method, it may be necessary to introduce some notations. Let \(T=\{(x_1, y_1), … , (x_n, y_n)\}\) be the full training set of \(n\) \(d\)-dimensional samples. where \(x_{i} \in \mathbb{R}^{d}\) is a feature vector and \(y_{i} \in \{-1, 1\}\) are corresponding labels. For convenience, matrix \(A \) in \(\mathbb{R}^{n_1 \times d}\) and matrix \(B \) in \(\mathbb{R}^{n_2 \times d}\) represents the samples of class \(+1\) and \(-1\), respectively.
The two non-parallel hyperplanes in TSVM are expressed as follows:
\[ {{x}^T}{{w}_{1}}+{{b}_{1}}=0 \quad \textrm{and} \quad {{x}^T}{{w}_{2}}+{{b}_{2}}=0\]In order to obtain the above non-parallel hyperplanes, TSVM solves two primal optimization problems which are expressed as follows:
\[\begin{equation}\begin{split}
\mathop{{ min}}\limits_{w_{1} ,b_{1}} \qquad & \frac{1}{2}{{\left\| A{{w}_{1}}+{{e}_{1}}{{b}_{1}} \right\|}^{2}}+{{c}_{1}}e_{2}^{T}\xi \\
\textrm{s.t. } \qquad & -(B{{w}_{1}}+{{e}_{2}}{{b}_{1}})+\xi\ge {{e}_{2}}\text{ },\xi\ge 0
\end{split}
\end{equation}\] \[\begin{equation}
\begin{split}
\mathop{{ min}}\limits_{w_{2} ,b_{2}} \qquad & \frac{1}{2}{{\left\| B{{w}_{2}}+{{e}_{2}}{{b}_{2}} \right\|}^{2}}+{{c}_{2}}e_{1}^{T}\eta \\
\textrm{s.t. } \qquad & (A{{w}_{2}}+{{e}_{1}}{{b}_{2}})+\eta\ge {{e}_{1}}\text{ },\eta\ge 0
\end{split}
\end{equation}\]
where \(c_{1}\) and \(c_{2}\) are positive hyper-parameters, \(\xi\) and \(\eta\) are slack vectors, \(e_{1}\) and \(e_{2}\) are the column vectors of ones of \(n_{1}\) and \(n_{2}\) dimensions.
At first glance, the above two primal problems might seem complex. Actually, they are but I explain each term in the optimization problem so that you can comprehend them.
- The first term \({\left\| A{{w}_{1}}+{{e}_{1}}{{b}_{1}} \right\|}^{2}\) in the objective function, measures distance of positive samples from their own hyperplane. As I said earlier, the main idea is to make the hyperplane as close as possible to samples of its own class.
- The second term \({{c}_{1}}e_{2}^{T}\xi\) in the objective function, allows some error in classification. In the real word problems, samples are non-linearly separable. Therefore, the constraint of the optimization problem is violated for some samples. To address this issue, a non-negative slack variable is added to the formulation.
- The constraint of the optimization problem \(-(B{{w}_{1}}+{{e}_{2}}{{b}_{1}})+\xi\ge {{e}_{2}}\text{ }\) states that the distance of hyperplane from the negative samples must be 1 or higher.
All in all, the general aim of the first optimization problem is that the hyperplane of positive samples should be as close as possible to samples of its own class and far away from the negative samples.
The second optimization problem is interpreted same as the first one. Hence the hyperplane of negative samples should be as close as possible to samples of its own class and far away from the positive samples.
To obtain the two non-parallel hyperplanes, the above two primal problems should be solved. However, as I previously said, these are complex optimization problems. Because there are three unknown variables to solve for. In practice, the dual formulation of the above two optimization problem is solved. Since this is a brief introduction, I have decided not to include full detail of how the solution is obtained.
To sum up this post, It’s essential to explain how to classify an unseen sample using the two non-parallel hyperplanes. A new sample is assigned to class \(j(j=-1,+1)\) by using the following decision function:
\[ D(x) = \underset{j=1,2}{\mathop{arg\min }}\,\,\text{ }\left| {{x}^{T}}{{w}_{j}}+{{b}_{j}} \right|\]where \(|\ .\ |\) denotes the perpendicular distance of sample \(x\) from the hyperplane. Lastly, it is worth mentioning that TSVM can also handle non-linear problems by using a kernel function such as Gaussian.
Hope you enjoyed this post and have learned another classifier. If you are interested in using TSVM classifier, you can check out the LightTwinSVM program which is a fast and simple implementation of TSVM classifier.
Let me know your thoughts and questions in the comments section below.