How to do Support Vector Machine from scratch with Scikit-learn?

  1. How to load data?

1. Import display, %matplotlib inline, datasets and numpy packages

from IPython.display import Image
%matplotlib inline
from sklearn import datasets
import numpy as np

2. Load predefined data set by writing datasets.load_datasetname(). Store result in a variable

iris = datasets.load_iris()

2. How to identify data features and target?

  1. Column 2 and 3 are the data features. Write and select all its rows and just the 2nd and 3rd column[:, [2, 3]]
  2. Store result in a variable

X =[:, [2, 3]]

3. Write This selects all the entire column by default

4. Store result in a variable

y =

3. How to identify training and testing data?

  1. Import train_test_split and cross_val_score
from sklearn.cross_validation import train_test_split, cross_val_score

2. Write train_test_split function


3. Use data features, target as argument

train_test_split(X, y)

4. Optionally, add to other arguments like test_size and random_state

train_test_split(X, y, test_size=0.33, random_state=42)

5. Store function in variables that represent training data features, testing data features, training target and testing target (in that order)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

4. How to scale my data?

  1. Import the following
from sklearn.preprocessing import StandardScaler

2. Write StandardScaler function and store it in a variable

sc = StandardScaler()

3. Fit scaler (variable) in terms of the training data to create a model; thus, use the training data features as argument

4. Find the standard deviation of the training and testing data features by using each of them as arguments of two variable.transform()


5. Store both results in variables

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

5. What are the functions behind Support Vector Machine?

  1. Import the following
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import warnings
  1. Split decimals function
  • Create a function that takes a string argument
def versiontuple(v):
  • Inside the function, split argument wherever it has a dot
def versiontuple(v):
  • Inside the function, turns the resulting strings as integers
def versiontuple(v):
    map(int, (v.split(".")))
  • Return the integers inside a tuple
def versiontuple(v):
    return tuple(map(int, (v.split("."))))

2. Plot decision regions and data function

  • Create a function that takes data features, target and classifier as arguments. Also set test_idx parameter is set to None (so that there is no training data for now) and resolution parameter to 0.02 (so that we have a relatively small step)

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

  • Create two tuples that will represent our marker generator and color map. The first one will be a set of letters and the second one will be a set string colors.
    ('s', 'x', 'o', '^', 'v')
    ('red', 'blue', 'lightgreen', 'gray', 'cyan')
  • Store results tuples in two separate variables
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
  • Write ListedColormap function to get the code that symbolizes a certain color


  • As an argument, take all colors from index 0 to the amount of target classes


  • Store result in a variable

    cmap = ListedColormap(colors[:len(np.unique(y))])

  • Define the range maximum and minimum for the first column. For doing so, find its maximum row value plus 1 and its minimum row value minus 1. Remember two put a comma between maximum and minimum
    X[:, 0].min() - 1, X[:, 0].max() + 1
  • Store result in two new variables
    X[:, 1].min() - 1, X[:, 1].max() + 1
  • Do the same for the following column and print results
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    print x1_min, x1_max
    print x2_min, x2_max

  • Turn each line into array of numbers from x_min to x_max whose numbers differ by a small step (resolution) and see output
    np.arange(x1_min, x1_max, resolution),
    np.arange(x2_min, x2_max, resolution)

  • Make a grid (array of list of lists) by using the arrays above as arguments of np.meshgrid()
     np.meshgrid(np.arange(x1_min, x1_max, resolution),
     np.arange(x2_min, x2_max, resolution))
  • Store result in two variables and see output
     xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
     np.arange(x2_min, x2_max, resolution))
     xx1, xx2

  • Flatten each grid (so that all its lists become a single list inside a list) using .ravel and make an array with the results

     np.array([xx1.ravel(), xx2.ravel()])

  • Transpose array, so that number of columns become number of rows and viceversa

  • Predict outcome using previous array using classifier. Classifier is a general argument we give to the function, but we will later input a Logistic Regression model in its place

  • Reshape the array so that it has the same amount of lists as the grid. See output
     Z = Z.reshape(xx1.shape)

  • Create a color map of the two grid arrays (that represent map’s X and Y axis) and reshaped Z (that represents map’s Z axis). To do this write plt.contourf and use the two grid arrays and reshaped Z as arguments

    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)

  • Set alpha (transparecy) parameter to some value and cmap (type of color map) parameter to cmap. See output

    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)

  • Just in case you want to be extra sure that you are setting your axis to desired boundaries, write the following
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
  • Loop through all the target types

    for cl in np.unique(y):

  • Assign a number for each type as you loop

    for idx, cl in enumerate(np.unique(y)):

  • Inside loop, create a Scatter Plot


  • Index all data features for which a target is a type of color and belongs to column 0

        X[y == cl, 0]

  • Set previous result to x paramer of ScatterPlot

        plt.scatter(x=X[y == cl, 0])

  • Do the same for column 1, but set its result o y parameter of ScatterPlot

        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1])

  • Set c paramater to cmap(number of types of data features), so that we get three different colors for our data

        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], c=cmap(idx))

  • Optionally, to alter color set alpha (transparency) parameter to 0.8, marker to markers[number of types of data features] and label to type of data features
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
        alpha=0.8, c=cmap(idx),
        marker=markers[idx], label=cl)

  • Write numpy version we are using


  • Split previous value with split decimals function (step 1)


  • Do the same for string ‘1.9.0


  • Compare both versions in the following way to see if our current version is greater than 1.9

        if not versiontuple(np.__version__) >= versiontuple('1.9.0')

  • If it is greater, index a certain amount of numbers from the X and Y arrays. These would be our testing data features and target. Notice we call test_idx to the 1st to nth number we are going to choose from the data features and target, because we will input them in the function we are

            X_test, y_test = X[test_idx], y[test_idx]

  • Else, give a warning and index still
            X_test, y_test = X[test_idx], y[test_idx]
  • See output with sample data features, target, logistic Regression classifier and test that take 105th to 149th values of the data features and targets each
plot_decision_regions(X, y,
classifier=lr, test_idx=range(105, 150))

  • Notice that if we input the scaled standard deviation of the data features, instead of the data featurest, we plot this

3. Sigmoid function

  • Import pyplot and numpy
import matplotlib.pyplot as plt
import numpy as np
  • Create a function that takes an argument

def sigmoid(z):

  • Return a function that calculates 1.0 over 1.0 plus the e^-x’s of the n array of numbers
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))
  • Create the array of numbers and store it in variable

z = np.arange(-7, 7, 0.1)

  • Write function in terms of previous variable. Store function in new variable

phi_z = sigmoid(z)

  • Plot array of numbers vs function. You may need to add at the end of every update

plt.plot(z, phi_z)

  • Plot x=0 (vertical middle line)

plt.axvline(0.0, color=’k’)

  • Define visual limit below y=0 and above y=1 for better visualization

plt.ylim(-0.1, 1.1)

  • Label x axis as z and y axis as phi(z)


plt.ylabel(‘$\phi (z)$’)

  • If you just want to see relevant numbers in the y-axis, choose the ticks

plt.yticks([0.0, 0.5, 1.0])

  • If you want dotted horizontal lines passing by your ticks, write the following
ax = plt.gca()

  • If we want to make the graph bigger and more visually proportional, write this


  • To save your picture, write the path where you want to save it, / and picture_name.png. You can set dpi (pixels) parameter to 300
plt.savefig('/Users/omaraguilar/DSI_SM_01/curriculum/week-05/sigmoid.png', dpi=300)

4. Cost 0 and 1 function. Is this variance and bias tradeoff? (fix!)

  • Create a function that takes an argument

def cost_1(z):

  • Return the negative logarithm of the sigmoid function (Step 2)

    return - np.log(sigmoid(z))

  • Create another function that takes an argument

def cost_0(z):

  • Retun the negative logarithm of 1 minus the sigmoid function

    return - np.log(1 - sigmoid(z))

4. Join sigmoid, cost_0 and cost_1 functions

  • Define z as an array whith numbers from – 10 to 10 that differ by 0.1

z = np.arange(-10, 10, 0.1)

  • Store sigmoid function in a variable

phi_z = sigmoid(z)

  • Loop through mentioned array, and append the values of cost_1(x) into a list

c1 = [cost_1(x) for x in z]

  • Plot the values of sigmoid function v.s. previous list values

plt.plot(phi_z, c1)

  • Set label parameter to ‘J(w) if y=1’

plt.plot(phi_z, c1, label='J(w) if y=1')

  • Loop through mentioned array, and append the values of cost_0(x) into a list

c0 = [cost_0(x) for x in z]

  • Plot the values of sigmoid function v.s. previous list values. You may need to write at the end of each update

plt.plot(phi_z, c0)

  • Set label parameter to ‘J(w) if y=0’

plt.plot(phi_z, c0, label='J(w) if y=0')

  • Define vertical visual limit below y=0 and above y=1 for better visualization

plt.ylim(0.0, 5.1)

  • Define horizontal visual limit below x=0 and above y=1 for better visualization

plt.xlim([0, 1])

  • Label x axis as phi(z) and y axis as J(w)
  • To create a legent write the following


  • If we want to make the graph bigger and more visually proportional, write this


  • To save your picture, write the path where you want to save it, / and picture_name.png. You can set dpi (pixels) parameter to 300
plt.savefig('/Users/omaraguilar/DSI_SM_01/curriculum/week-05/log_cost.png', dpi=300)
  • See output

5. Improve Plot Decisions and data function

  • Import Logistic Regression from scikit learn package

from sklearn.linear_model import LogisticRegression

  • Use the training and testing scaled data features as arguments of np.vstack to create an array with an two vertically piled arrays inside

np.vstack((X_train_std, X_test_std))

  • Store array in a variable

X_combined_std = np.vstack((X_train_std, X_test_std))

  • Do the same for training and testing non-scaled target

y_combined = np.hstack((y_train, y_test))

  • Set LogisticRegression’s regularization strength parameter (c) to a high value to get a small precise width


  • Set LogisticRegression’s regularization random_state parameter to 0 , so that we can contol the model

LogisticRegression(C=1000.0, random_state=0)

  • Store function in an indented variable for better manipulation

lr = LogisticRegression(C=1000.0, random_state=0)

  • Fit function to scaled training data and training target to create a model, y_train)

  • Use the two variables we defined at the beginning of step 5 as arguments of the plot decisions and data function

plot_decision_regions(X_combined_std, y_combined)

  • Also set classifier parameter to the indented variable and test_idx parameter to the 1st to nth number we are going to choose from the data features and target from
plot_decision_regions(X_combined_std, y_combined,
classifier=lr, test_idx=range(105, 150))
  • Label x axis as scaled petal length and y axis as scaled petal width
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
  • To create a legend write the following

plt.legend(loc='upper left')

  • If we want to make the graph bigger and more visually proportional, write this


  • To save your picture, write the path where you want to save it, / and picture_name.png. You can set dpi (pixels) parameter to 300
plt.savefig('/Users/omaraguilar/DSI_SM_01/curriculum/week-05/log_cost.png', dpi=300)
  • See output

Leave a comment