How to do Support Vector Machine from scratch with Scikit-learn?

  1. How to load data?

1. Import display, %matplotlib inline, datasets and numpy packages

from IPython.display import Image
%matplotlib inline
from sklearn import datasets
import numpy as np

2. Load predefined data set by writing datasets.load_datasetname(). Store result in a variable

iris = datasets.load_iris()

2. How to identify data features and target?

  1. Column 2 and 3 are the data features. Write variable.data and select all its rows and just the 2nd and 3rd column
    iris.data[:, [2, 3]]
  2. Store result in a variable

X = iris.data[:, [2, 3]]

3. Write variable.target. This selects all the entire column by default
iris.target

4. Store result in a variable

y = iris.target

3. How to identify training and testing data?

  1. Import train_test_split and cross_val_score
from sklearn.cross_validation import train_test_split, cross_val_score

2. Write train_test_split function

train_test_split()

3. Use data features, target as argument

train_test_split(X, y)

4. Optionally, add to other arguments like test_size and random_state

train_test_split(X, y, test_size=0.33, random_state=42)

5. Store function in variables that represent training data features, testing data features, training target and testing target (in that order)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

4. How to scale my data?

  1. Import the following
from sklearn.preprocessing import StandardScaler

2. Write StandardScaler function and store it in a variable

sc = StandardScaler()

3. Fit scaler (variable) in terms of the training data to create a model; thus, use the training data features as argument
sc.fit(X_train)

4. Find the standard deviation of the training and testing data features by using each of them as arguments of two variable.transform()

sc.transform(X_train)
sc.transform(X_test)

5. Store both results in variables

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

5. What are the functions behind Support Vector Machine?

  1. Import the following
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import warnings
  1. Split decimals function
  • Create a function that takes a string argument
def versiontuple(v):
  • Inside the function, split argument wherever it has a dot
def versiontuple(v):
    v.split(".")
  • Inside the function, turns the resulting strings as integers
def versiontuple(v):
    map(int, (v.split(".")))
  • Return the integers inside a tuple
def versiontuple(v):
    return tuple(map(int, (v.split("."))))

2. Plot decision regions and data function

  • Create a function that takes data features, target and classifier as arguments. Also set test_idx parameter is set to None (so that there is no training data for now) and resolution parameter to 0.02 (so that we have a relatively small step)

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

  • Create two tuples that will represent our marker generator and color map. The first one will be a set of letters and the second one will be a set string colors.
    ('s', 'x', 'o', '^', 'v')
    ('red', 'blue', 'lightgreen', 'gray', 'cyan')
  • Store results tuples in two separate variables
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
  • Write ListedColormap function to get the code that symbolizes a certain color

    ListedColormap()

  • As an argument, take all colors from index 0 to the amount of target classes

    ListedColormap(colors[:len(np.unique(y))])

  • Store result in a variable

    cmap = ListedColormap(colors[:len(np.unique(y))])

  • Define the range maximum and minimum for the first column. For doing so, find its maximum row value plus 1 and its minimum row value minus 1. Remember two put a comma between maximum and minimum
    X[:, 0].min() - 1, X[:, 0].max() + 1
  • Store result in two new variables
    X[:, 1].min() - 1, X[:, 1].max() + 1
  • Do the same for the following column and print results
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    print x1_min, x1_max
    print x2_min, x2_max

  • Turn each line into array of numbers from x_min to x_max whose numbers differ by a small step (resolution) and see output
    np.arange(x1_min, x1_max, resolution),
    np.arange(x2_min, x2_max, resolution)

  • Make a grid (array of list of lists) by using the arrays above as arguments of np.meshgrid()
     np.meshgrid(np.arange(x1_min, x1_max, resolution),
     np.arange(x2_min, x2_max, resolution))
  • Store result in two variables and see output
     xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
     np.arange(x2_min, x2_max, resolution))
     xx1, xx2

  • Flatten each grid (so that all its lists become a single list inside a list) using .ravel and make an array with the results

     np.array([xx1.ravel(), xx2.ravel()])

  • Transpose array, so that number of columns become number of rows and viceversa

  • Predict outcome using previous array using classifier. Classifier is a general argument we give to the function, but we will later input a Logistic Regression model in its place

  • Reshape the array so that it has the same amount of lists as the grid. See output
     Z = Z.reshape(xx1.shape)
     Z

  • Create a color map of the two grid arrays (that represent map’s X and Y axis) and reshaped Z (that represents map’s Z axis). To do this write plt.contourf and use the two grid arrays and reshaped Z as arguments

    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)

  • Set alpha (transparecy) parameter to some value and cmap (type of color map) parameter to cmap. See output

    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)

  • Just in case you want to be extra sure that you are setting your axis to desired boundaries, write the following
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
  • Loop through all the target types

    for cl in np.unique(y):

  • Assign a number for each type as you loop

    for idx, cl in enumerate(np.unique(y)):

  • Inside loop, create a Scatter Plot

        plt.scatter()

  • Index all data features for which a target is a type of color and belongs to column 0

        X[y == cl, 0]

  • Set previous result to x paramer of ScatterPlot

        plt.scatter(x=X[y == cl, 0])

  • Do the same for column 1, but set its result o y parameter of ScatterPlot

        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1])

  • Set c paramater to cmap(number of types of data features), so that we get three different colors for our data

        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], c=cmap(idx))

  • Optionally, to alter color set alpha (transparency) parameter to 0.8, marker to markers[number of types of data features] and label to type of data features
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
        alpha=0.8, c=cmap(idx),
        marker=markers[idx], label=cl)

  • Write numpy version we are using

        np.__version__

  • Split previous value with split decimals function (step 1)

        versiontuple(np.__version__)

  • Do the same for string ‘1.9.0

        versiontuple('1.9,0')

  • Compare both versions in the following way to see if our current version is greater than 1.9

        if not versiontuple(np.__version__) >= versiontuple('1.9.0')

  • If it is greater, index a certain amount of numbers from the X and Y arrays. These would be our testing data features and target. Notice we call test_idx to the 1st to nth number we are going to choose from the data features and target, because we will input them in the function we are

            X_test, y_test = X[test_idx], y[test_idx]

  • Else, give a warning and index still
        else:
            X_test, y_test = X[test_idx], y[test_idx]
  • See output with sample data features, target, logistic Regression classifier and test that take 105th to 149th values of the data features and targets each
plot_decision_regions(X, y,
classifier=lr, test_idx=range(105, 150))

  • Notice that if we input the scaled standard deviation of the data features, instead of the data featurest, we plot this

3. Sigmoid function

  • Import pyplot and numpy
import matplotlib.pyplot as plt
import numpy as np
  • Create a function that takes an argument

def sigmoid(z):

  • Return a function that calculates 1.0 over 1.0 plus the e^-x’s of the n array of numbers
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))
  • Create the array of numbers and store it in variable

z = np.arange(-7, 7, 0.1)

  • Write function in terms of previous variable. Store function in new variable

phi_z = sigmoid(z)

  • Plot array of numbers vs function. You may need to add plt.show() at the end of every update

plt.plot(z, phi_z)

  • Plot x=0 (vertical middle line)

plt.axvline(0.0, color=’k’)

  • Define visual limit below y=0 and above y=1 for better visualization

plt.ylim(-0.1, 1.1)

  • Label x axis as z and y axis as phi(z)

plt.xlabel(‘z’)

plt.ylabel(‘$\phi (z)$’)

  • If you just want to see relevant numbers in the y-axis, choose the ticks

plt.yticks([0.0, 0.5, 1.0])

  • If you want dotted horizontal lines passing by your ticks, write the following
ax = plt.gca()
ax.yaxis.grid(True)

  • If we want to make the graph bigger and more visually proportional, write this

plt.tight_layout()

  • To save your picture, write the path where you want to save it, / and picture_name.png. You can set dpi (pixels) parameter to 300
plt.savefig('/Users/omaraguilar/DSI_SM_01/curriculum/week-05/sigmoid.png', dpi=300)

4. Cost 0 and 1 function. Is this variance and bias tradeoff? (fix!)

  • Create a function that takes an argument

def cost_1(z):

  • Return the negative logarithm of the sigmoid function (Step 2)

    return - np.log(sigmoid(z))

  • Create another function that takes an argument

def cost_0(z):

  • Retun the negative logarithm of 1 minus the sigmoid function

    return - np.log(1 - sigmoid(z))

4. Join sigmoid, cost_0 and cost_1 functions

  • Define z as an array whith numbers from – 10 to 10 that differ by 0.1

z = np.arange(-10, 10, 0.1)

  • Store sigmoid function in a variable

phi_z = sigmoid(z)

  • Loop through mentioned array, and append the values of cost_1(x) into a list

c1 = [cost_1(x) for x in z]

  • Plot the values of sigmoid function v.s. previous list values

plt.plot(phi_z, c1)

  • Set label parameter to ‘J(w) if y=1’

plt.plot(phi_z, c1, label='J(w) if y=1')

  • Loop through mentioned array, and append the values of cost_0(x) into a list

c0 = [cost_0(x) for x in z]

  • Plot the values of sigmoid function v.s. previous list values. You may need to write plt.show() at the end of each update

plt.plot(phi_z, c0)

  • Set label parameter to ‘J(w) if y=0’

plt.plot(phi_z, c0, label='J(w) if y=0')

  • Define vertical visual limit below y=0 and above y=1 for better visualization

plt.ylim(0.0, 5.1)

  • Define horizontal visual limit below x=0 and above y=1 for better visualization

plt.xlim([0, 1])

  • Label x axis as phi(z) and y axis as J(w)
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
  • To create a legent write the following

plt.legend(loc='best')

  • If we want to make the graph bigger and more visually proportional, write this

plt.tight_layout()

  • To save your picture, write the path where you want to save it, / and picture_name.png. You can set dpi (pixels) parameter to 300
plt.savefig('/Users/omaraguilar/DSI_SM_01/curriculum/week-05/log_cost.png', dpi=300)
  • See output

5. Improve Plot Decisions and data function

  • Import Logistic Regression from scikit learn package

from sklearn.linear_model import LogisticRegression

  • Use the training and testing scaled data features as arguments of np.vstack to create an array with an two vertically piled arrays inside

np.vstack((X_train_std, X_test_std))

  • Store array in a variable

X_combined_std = np.vstack((X_train_std, X_test_std))

  • Do the same for training and testing non-scaled target

y_combined = np.hstack((y_train, y_test))

  • Set LogisticRegression’s regularization strength parameter (c) to a high value to get a small precise width

LogisticRegression(C=1000.0)

  • Set LogisticRegression’s regularization random_state parameter to 0 , so that we can contol the model

LogisticRegression(C=1000.0, random_state=0)

  • Store function in an indented variable for better manipulation

lr = LogisticRegression(C=1000.0, random_state=0)

  • Fit function to scaled training data and training target to create a model

lr.fit(X_train_std, y_train)

  • Use the two variables we defined at the beginning of step 5 as arguments of the plot decisions and data function

plot_decision_regions(X_combined_std, y_combined)

  • Also set classifier parameter to the indented variable and test_idx parameter to the 1st to nth number we are going to choose from the data features and target from
plot_decision_regions(X_combined_std, y_combined,
classifier=lr, test_idx=range(105, 150))
  • Label x axis as scaled petal length and y axis as scaled petal width
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
  • To create a legend write the following

plt.legend(loc='upper left')

  • If we want to make the graph bigger and more visually proportional, write this

plt.tight_layout()

  • To save your picture, write the path where you want to save it, / and picture_name.png. You can set dpi (pixels) parameter to 300
plt.savefig('/Users/omaraguilar/DSI_SM_01/curriculum/week-05/log_cost.png', dpi=300)
  • See output

Leave a comment