Support Vector Machine

What is a Support Vector Machine model?

  • Binary linear classifier explicitly constructed to minimize generalization error (minimize error due to variance, future prediction)

How to create Support Vector Machine from scratch using math?

  1. Use the following geometric concepts instead of the mainstream algebraic concepts

Problem space: Space in which all the data points are distributed

Decision boundary: Region that breaks the problem space in two parts (not including region)

Margin: Width of decision boundary

2. Apply Support Vector Machine to find the decision boundary with the largest margin i.e. maximum margin hyperplane (MMH)

3. To do so, use the descriminant function to find to derive the equations you need

here w is the weight vector and b is the bias. The sign of f(x) determines the (binary) class label of a record x.

4. Here is the derived equations you need. Notice margin depends on just some training data points near to the decision boundary (support vectors)

5. Determine if we can find a margin to linearly separate different classes (types of something) or not

If we can, we use step above, if we can’t, go to next step

6. Here is the derived equations you need

where hyper-parameter C(soft-margin constant) controls overall complexity by specifying penalty for training error. Notice how margin barely measures 0.2 when C=100, but it measures 0.6 when C=10

 

What if the region has a non-linear shape? in other words, how do we come up for formulas for higher dimension spaces?

  • Previous equations can be re-written in the following way

  • To update the shape of the region, just replace inner product with a more general function (i.e. kernel) using kernel trick. Here are a list of kernels.

where hyperparameters d and γ affect the flexibility of the decision boundary.

What is a more detailed explanation of the non-linear SVM?

  • Nonlinear applications of SVM rely on an implicit (nonlinear) mapping that sends vectors from the original feature space K into a higher-dimensional feature space K’. Nonlinear classification in K is then obtained by creating a linear decision boundary in K’. In practice, this involves no computations in the higher dimensional space, thanks to what is called the kernel trick.

What are the pros and cons of Support Vector Machine?

Pros:

  • Very powerful, good performance
  • can be used for anomaly detection (one-class SVM)

Cons:

  • Can get very hard to train with lots of data
  • Prone to overfit (need regularization)
  • Black box

How to tune Support Vector Machine?

  • Employ cross-validation

How exactly (mathematically) does soft margin constant penalizes training data?

What does minimize and subject to mean?

Leave a comment