How to do Support Vector Machine from scratch with Scikit-learn?

How to load data?

1. Import display, %matplotlib inline, datasets and numpy packages

from IPython.display import Image
%matplotlib inline
from sklearn import datasets
import numpy as np

2. Load predefined data set by writing datasets.load_datasetname(). Store result in a variable

iris = datasets.load_iris()

2. How to identify data features and target?

Column 2 and 3 are the data features. Write variable.data and select all its rows and just the 2nd and 3rd column
iris.data[:, [2, 3]]
Store result in a variable

X = iris.data[:, [2, 3]]

3. Write variable.target. This selects all the entire column by default
iris.target

4. Store result in a variable

y = iris.target

3. How to identify training and testing data?

Import train_test_split and cross_val_score

from sklearn.cross_validation import train_test_split, cross_val_score

2. Write train_test_split function

train_test_split()

3. Use data features, target as argument

train_test_split(X, y)

4. Optionally, add to other arguments like test_size and random_state

train_test_split(X, y, test_size=0.33, random_state=42)

5. Store function in variables that represent training data features, testing data features, training target and testing target (in that order)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

4. How to scale my data?

Import the following

from sklearn.preprocessing import StandardScaler

2. Write StandardScaler function and store it in a variable

sc = StandardScaler()

3. Fit scaler (variable) in terms of the training data to create a model; thus, use the training data features as argument
sc.fit(X_train)

4. Find the standard deviation of the training and testing data features by using each of them as arguments of two variable.transform()

sc.transform(X_train)
sc.transform(X_test)

5. Store both results in variables

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

5. What are the functions behind Support Vector Machine?

Import the following

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import warnings

Split decimals function

Create a function that takes a string argument

def versiontuple(v):

Inside the function, split argument wherever it has a dot

def versiontuple(v):
    v.split(".")

Inside the function, turns the resulting strings as integers

def versiontuple(v):
    map(int, (v.split(".")))

Return the integers inside a tuple

def versiontuple(v):
    return tuple(map(int, (v.split("."))))

2. Plot decision regions and data function

Create a function that takes data features, target and classifier as arguments. Also set test_idx parameter is set to None (so that there is no training data for now) and resolution parameter to 0.02 (so that we have a relatively small step)

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

Create two tuples that will represent our marker generator and color map. The first one will be a set of letters and the second one will be a set string colors.

    ('s', 'x', 'o', '^', 'v')
    ('red', 'blue', 'lightgreen', 'gray', 'cyan')

Store results tuples in two separate variables

    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')

Write ListedColormap function to get the code that symbolizes a certain color

ListedColormap()

As an argument, take all colors from index 0 to the amount of target classes

ListedColormap(colors[:len(np.unique(y))])

Store result in a variable

cmap = ListedColormap(colors[:len(np.unique(y))])

Define the range maximum and minimum for the first column. For doing so, find its maximum row value plus 1 and its minimum row value minus 1. Remember two put a comma between maximum and minimum

    X[:, 0].min() - 1, X[:, 0].max() + 1

Store result in two new variables

    X[:, 1].min() - 1, X[:, 1].max() + 1

Do the same for the following column and print results

    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    print x1_min, x1_max
    print x2_min, x2_max

Turn each line into array of numbers from x_min to x_max whose numbers differ by a small step (resolution) and see output

    np.arange(x1_min, x1_max, resolution),
    np.arange(x2_min, x2_max, resolution)

Make a grid (array of list of lists) by using the arrays above as arguments of np.meshgrid()

     np.meshgrid(np.arange(x1_min, x1_max, resolution),
     np.arange(x2_min, x2_max, resolution))

Store result in two variables and see output

     xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
     np.arange(x2_min, x2_max, resolution))
     xx1, xx2

Flatten each grid (so that all its lists become a single list inside a list) using .ravel and make an array with the results

np.array([xx1.ravel(), xx2.ravel()])

Transpose array, so that number of columns become number of rows and viceversa

Predict outcome using previous array using classifier. Classifier is a general argument we give to the function, but we will later input a Logistic Regression model in its place

Reshape the array so that it has the same amount of lists as the grid. See output

     Z = Z.reshape(xx1.shape)
     Z

Create a color map of the two grid arrays (that represent map’s X and Y axis) and reshaped Z (that represents map’s Z axis). To do this write plt.contourf and use the two grid arrays and reshaped Z as arguments

plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)

Set alpha (transparecy) parameter to some value and cmap (type of color map) parameter to cmap. See output

plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)

Just in case you want to be extra sure that you are setting your axis to desired boundaries, write the following

    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

Loop through all the target types

for cl in np.unique(y):

Assign a number for each type as you loop

for idx, cl in enumerate(np.unique(y)):

Inside loop, create a Scatter Plot

plt.scatter()

Index all data features for which a target is a type of color and belongs to column 0

X[y == cl, 0]

Set previous result to x paramer of ScatterPlot

plt.scatter(x=X[y == cl, 0])

Do the same for column 1, but set its result o y parameter of ScatterPlot

plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1])

Set c paramater to cmap(number of types of data features), so that we get three different colors for our data

plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], c=cmap(idx))

Optionally, to alter color set alpha (transparency) parameter to 0.8, marker to markers[number of types of data features] and label to type of data features

        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
        alpha=0.8, c=cmap(idx),
        marker=markers[idx], label=cl)

Write numpy version we are using

np.__version__

Split previous value with split decimals function (step 1)

versiontuple(np.__version__)

Do the same for string ‘1.9.0

versiontuple('1.9,0')

Compare both versions in the following way to see if our current version is greater than 1.9

if not versiontuple(np.__version__) >= versiontuple('1.9.0')

If it is greater, index a certain amount of numbers from the X and Y arrays. These would be our testing data features and target. Notice we call test_idx to the 1st to nth number we are going to choose from the data features and target, because we will input them in the function we are

X_test, y_test = X[test_idx], y[test_idx]

Else, give a warning and index still

        else:
            X_test, y_test = X[test_idx], y[test_idx]

See output with sample data features, target, logistic Regression classifier and test that take 105th to 149th values of the data features and targets each

plot_decision_regions(X, y,
classifier=lr, test_idx=range(105, 150))

Notice that if we input the scaled standard deviation of the data features, instead of the data featurest, we plot this

3. Sigmoid function

Import pyplot and numpy

import matplotlib.pyplot as plt
import numpy as np

Create a function that takes an argument

def sigmoid(z):

Return a function that calculates 1.0 over 1.0 plus the e^-x’s of the n array of numbers

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

Create the array of numbers and store it in variable

z = np.arange(-7, 7, 0.1)

Write function in terms of previous variable. Store function in new variable

phi_z = sigmoid(z)

Plot array of numbers vs function. You may need to add plt.show() at the end of every update

plt.plot(z, phi_z)

Plot x=0 (vertical middle line)

plt.axvline(0.0, color=’k’)

Define visual limit below y=0 and above y=1 for better visualization

plt.ylim(-0.1, 1.1)

Label x axis as z and y axis as phi(z)

plt.xlabel(‘z’)

plt.ylabel(‘$\phi (z)$’)

If you just want to see relevant numbers in the y-axis, choose the ticks

plt.yticks([0.0, 0.5, 1.0])

If you want dotted horizontal lines passing by your ticks, write the following

ax = plt.gca()
ax.yaxis.grid(True)

If we want to make the graph bigger and more visually proportional, write this

plt.tight_layout()

To save your picture, write the path where you want to save it, / and picture_name.png. You can set dpi (pixels) parameter to 300

plt.savefig('/Users/omaraguilar/DSI_SM_01/curriculum/week-05/sigmoid.png', dpi=300)

4. Cost 0 and 1 function. Is this variance and bias tradeoff? (fix!)

Create a function that takes an argument

def cost_1(z):

Return the negative logarithm of the sigmoid function (Step 2)

return - np.log(sigmoid(z))

Create another function that takes an argument

def cost_0(z):

Retun the negative logarithm of 1 minus the sigmoid function

return - np.log(1 - sigmoid(z))

4. Join sigmoid, cost_0 and cost_1 functions

Define z as an array whith numbers from – 10 to 10 that differ by 0.1

z = np.arange(-10, 10, 0.1)

Store sigmoid function in a variable

phi_z = sigmoid(z)

Loop through mentioned array, and append the values of cost_1(x) into a list

c1 = [cost_1(x) for x in z]

Plot the values of sigmoid function v.s. previous list values

plt.plot(phi_z, c1)

Set label parameter to ‘J(w) if y=1’

plt.plot(phi_z, c1, label='J(w) if y=1')

Loop through mentioned array, and append the values of cost_0(x) into a list

c0 = [cost_0(x) for x in z]

Plot the values of sigmoid function v.s. previous list values. You may need to write plt.show() at the end of each update

plt.plot(phi_z, c0)

Set label parameter to ‘J(w) if y=0’

plt.plot(phi_z, c0, label='J(w) if y=0')

Define vertical visual limit below y=0 and above y=1 for better visualization

plt.ylim(0.0, 5.1)

Define horizontal visual limit below x=0 and above y=1 for better visualization

plt.xlim([0, 1])

Label x axis as phi(z) and y axis as J(w)

plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')

To create a legent write the following

plt.legend(loc='best')

If we want to make the graph bigger and more visually proportional, write this

plt.tight_layout()

To save your picture, write the path where you want to save it, / and picture_name.png. You can set dpi (pixels) parameter to 300

plt.savefig('/Users/omaraguilar/DSI_SM_01/curriculum/week-05/log_cost.png', dpi=300)

See output

5. Improve Plot Decisions and data function

Import Logistic Regression from scikit learn package

from sklearn.linear_model import LogisticRegression

Use the training and testing scaled data features as arguments of np.vstack to create an array with an two vertically piled arrays inside

np.vstack((X_train_std, X_test_std))

Store array in a variable

X_combined_std = np.vstack((X_train_std, X_test_std))

Do the same for training and testing non-scaled target

y_combined = np.hstack((y_train, y_test))

Set LogisticRegression’s regularization strength parameter (c) to a high value to get a small precise width

LogisticRegression(C=1000.0)

Set LogisticRegression’s regularization random_state parameter to 0 , so that we can contol the model

LogisticRegression(C=1000.0, random_state=0)

Store function in an indented variable for better manipulation

lr = LogisticRegression(C=1000.0, random_state=0)

Fit function to scaled training data and training target to create a model

lr.fit(X_train_std, y_train)

Use the two variables we defined at the beginning of step 5 as arguments of the plot decisions and data function

plot_decision_regions(X_combined_std, y_combined)

Also set classifier parameter to the indented variable and test_idx parameter to the 1st to nth number we are going to choose from the data features and target from

plot_decision_regions(X_combined_std, y_combined,
classifier=lr, test_idx=range(105, 150))

Label x axis as scaled petal length and y axis as scaled petal width

plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')

To create a legend write the following

plt.legend(loc='upper left')

If we want to make the graph bigger and more visually proportional, write this

plt.tight_layout()

To save your picture, write the path where you want to save it, / and picture_name.png. You can set dpi (pixels) parameter to 300

plt.savefig('/Users/omaraguilar/DSI_SM_01/curriculum/week-05/log_cost.png', dpi=300)

See output

datascienceomar

How to do Support Vector Machine from scratch with Scikit-learn?

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply