Lab Session 2 - Supervised Learning⚓

Objectives of the lab⚓

At the end of this session, you will be able to :

Perform basic supervised learning tasks using sklearn.
Apply supervised learning on your chosen application

Note

Whenever possible, we have included links to the official documenation of the functions you need to use. Check them out, you'll save a lot of time !

Prerequisites⚓

You have to do lab 1 before. This includes :

Knowing how to manipulate numpy arrays, in particular indexing an array using another array
Knowing how to basic plotting of vectors using matplotlib.pyplot (1D plots, scatter plots, ...)
Knowing how to load / save numpy arrays

Intro to sklearn⚓

Install scikit-learn using pip. If you need details, see here, but a simple pip install scikit-learn should work.

Note

The package is imported using import sklearn but the actual package name for installation is scikit-learn

Basics of machine learning using sklearn⚓

sklearn is a very powerful package that implements most machine learning methods. sklearn also includes cross-validation procedures in order to prevent overfitting, many useful metrics and data manipulation techniques that enables very careful experimentations with machine learning. It is also very straightforward to use. We will introduce a few basic concepts of sklearn.

First, it is very easy to simulate data with sklearn. In lab 1, we made you code the generation of two point clouds using numpy functions. The make_blobs function does it, with many added options.

from sklearn.datasets import make_blobs

Use the function make_blobs to generate clouds of points with \(d=2\), and visualize them using the function scatter from matplotlib.pyplot. You can generate as many samples as you want. You can generate several clouds of points using the argument centers. We recommend using random_state=0 so that your results are from the same distribution as our tests.

Vocabulary : n_samples is the number of generated samples, n_features is \(d\) (number of dimensions), centers is the number of classes.

Hint : you can use the output "y" as an argument for the color argument ("c") of the scatter function.

###  - Generate blobs using the make_blobs function and plot the result using the scatter function with colors. 

You can use other arguments from make_blobs in order to change the variance of the blobs, or the coordinates of their center. You can also experiment on higher dimension, although it becomes difficult to visualize.

sklearn has many other data generators, as well as ways to load standard datasets of various sizes.

Train test Split⚓

Now that we have generated a simple dataset, let's try to do a basic supervised learning approach.

First, in order to mesure the ability of the model to generalize, we have to split the dataset into a training set and a test set. The test set is the part of the dataset that the model will not see during the training and will be used as a proxy for your "real world" examples.

Image taken from https://towardsdatascience.com/machine-learning-workflow-on-diabetes-data-part-01-573864fcc6b8

In sklearn, you can use the train_test_split function to split datasets.

Try to split the dataset you previously generated (the blobs) into x_train, x_test, y_train, y_test, with 80% in x_train and 20% in x_test. Set random_state = 0 so that the function always returns the same split.

Check the shapes of the generated vectors.

x_train.shape, x_test.shape, y_train.shape, y_test.shape

K-Nearest Neighbor⚓

Let's use a K-Nearest Neighbor classifier to test whether we can classify this dataset. Create a classifier, train it using your training set and evaluate it by its accuracy on both the train and test sets.

In K-Nearest Neighbor classification (also known as KNN), when you want to predict the class of an object, you look at the K (an hyperparameter) nearest examples from the training (using a distance metric, in our case the euclidean distance). This object is then classified by a majority vote among its neighbors. In other words, the class of the object is the most common class among its neighbours.

To use a Nearest Neighbor with sklearn, you have to use the class KNeighborsClassifier.

The sklearn API is consistent. This means that for almost every method they propose you can train it using object.fit, you can use it to make prediction with object.predict and finally verify the accuracy of the method using object.score.

###  - Train the classifier and get the accuracy in both sets.
from sklearn.neighbors import KNeighborsClassifier
k = 1
classifier = KNeighborsClassifier(n_neighbors=k, n_jobs=1)

Your classifier should have a train accuracy of 1, while the test accuracy should be high but not perfect.

This is caused by the bias-variance trade-off. The 1-NN classifier always has a bias of 0 (it perfectly classifies the training set) but it has a high variance given that having one more example in the training set can completely change a decision.

Try to avoid having such a high variance, test different values of k and plot the accuracies given the different values of the hyperparameter k.

If you have time, we advise you to do the same analysis but varying the train/test split size.

#  - Train networks with varying k.
train_acc_p = list()
test_acc_p = list()  # list storing the test set accuracies
test_ks = range(1,25)  # list containing values of k to be tested

Precision, Recall and F1-score⚓

Once your classifier is trained, and bias-variance analysed, it is time to look at other metrics based on your results. It is important to remember that accuracy is a key metric, but it is not the only metric you should be focusing on.

Print a classification report and a confusion matrix for both training and test sets.

In the classification report, you are going to see 3 new metrics. They are really important because the accuracy does not show a complete portrait of your results.

Precision: Percentage of correctly classified examples with respect to all retrieved examples
Recall: Percentage of correctly classified examples with respect to all examples belonging to a given class
F1 Score: Harmonic mean from precision and recall.

###  - Generate the report and confusion matrix for the test set.
from sklearn.metrics import classification_report,confusion_matrix

Decision Boundaries⚓

Finally, you are going to plot the decision boundaries of our model. Use the function plot_boundaries given below. You can only do this if the tensor representing your data is two dimensional.

This function will test our model with values ranging from the smallest x to the highest x and from the lowest y to the highest y, each varying by \(h\) and plot it nicely. Link to the original implementation.

from matplotlib.colors import ListedColormap
import numpy as np

def plot_boundaries(classifier,X,Y,h=0.2):
    x0_min, x0_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x1_min, x1_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    x0, x1 = np.meshgrid(np.arange(x0_min, x0_max,h),
                         np.arange(x1_min, x1_max,h))
    dataset = np.c_[x0.ravel(),x1.ravel()]
    Z = classifier.predict(dataset)

    # Put the result into a color plot
    Z = Z.reshape(x0.shape)
    plt.figure()
    plt.pcolormesh(x0, x1, Z)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=Y,
                edgecolor='k', s=20)
    plt.xlim(x0.min(), x0.max())
    plt.ylim(x1.min(), x1.max())
plot_boundaries(classifier,x_train,y_train)

Application⚓

There are three things to do :

Choose a supervised learning algorithm from the following list :
- Decision Trees - sklearn.tree.DecisionTreeClassifier
- Support Vector Machines - sklearn.svm.SVC
- Logistic Regression - sklearn.linear_model.LogisticRegression
- Adaboost - sklearn.ensemble.AdaBoostClassifier
- Multilayer Perceptron (MLP) - sklearn.neural_network.MLPClassifier
- Random Forest - sklearn.ensemble.RandomForestClassifier
Use synthetic data generation, or a simple "toy" dataset, to explain how the algorithm works, and the influence of its hyperparameters.
Apply this algorithm to your chosen application, by referring to the application pages (section "Ressources for Session 2")

In Session 3, you will have to present these three things in a 7 minute presentation.