Getting started with machine learning
using scikit-learn

James Bourbeau

Big Data Madison Meetup

April 24, 2018

GitHub repo with materials:

https://github.com/jrbourbeau/big-data-madison-ml-sklearn

Slides:

https://jrbourbeau.github.io/big-data-madison-ml-sklearn

Contact:

E-mail: james@jamesbourbeau.com

GitHub: jrbourbeau

Twitter: __jrbourbeau__

LinkedIn: jrbourbeau

Source code for plotting Python module can be found on GitHub with the rest of the materials for this talk

In [1]:
import plotting
import numpy as np
np.random.seed(2)
%matplotlib inline

Supervised machine learning workflow

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

Outline

  • What is machine learning?

    • Classical programming vs. machine learning

    • Supervised machine learning

  • scikit-learn:

    • Data representation

    • Estimator API

  • Example algorithm: decision tree classifier

  • Model validation

    • Cross validation

    • Validation curves

Machine learning vs. classical programming

Classical programming

  • Devise a set of rules (an algorithm) that are used to accomplish a task

  • For example, labeling e-mails as either "spam" or "not spam"

In [2]:
def spam_filter(email):
    """Function that labels an email as 'spam' or 'not spam'
    """
    if 'Act now!' in email.contents:
        label = 'spam'
    elif 'hotmail.com' in email.sender:
        label = 'spam'
    elif email.contents.count('$') > 20:
        label = 'spam'
    else:
        label = 'not spam'

    return label

Machine learning

  • "Field of study that gives computers the ability to learn without being explicitly programmed" — Arthur Samuel (1959)

  • "A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task." — Francois Chollet, Deep Learning with Python

Supervised machine learning

  • From a labeled dataset, an algorithm learns a mapping between input data and the desired output label

  • Goal is to have model generalize well to future, yet unseen, data

  • Supervised machine learning is further divided into two types of problems:

    • Classification — Labels are discrete. E.g. determine if a picture is of a cat, dog, or person.

    • Regression — Labels are continuous. E.g. predict home prices.

In [3]:
plotting.plot_classification_vs_regression()

Machine learning in Python with scikit-learn

scikit-learn

  • Popular Python machine learning library

  • Designed to be a well documented and approachable for non-specialist

  • Built on top of NumPy and SciPy

  • scikit-learn can be easily installed with pip or conda

    • pip install scikit-learn

    • conda install scikit-learn

Data representation in scikit-learn

  • Training dataset is described by a pair of matrices, one for the input data and one for the output

  • Most commonly used data formats are a NumPy ndarray or a Pandas DataFrame / Series

  • Each row of these matrices corresponds to one sample of the dataset

  • Each column represents a quantitative piece of information that is used to describe each sample (called "features")

In [4]:
plotting.plot_data_representation()

Iris dataset

  • Dataset consists of 150 samples (individual flowers) that have 4 features: sepal length, sepal width, petal length, and petal width (all in cm)

  • Each sample is labeled by its species: Iris Setosa, Iris Versicolour, Iris Virginica

  • Task is to develop a model that predicts iris species

  • Iris dataset is freely available from the UCI Machine Learning Repository

Iris dataset

Loading the iris dataset

In [5]:
import pandas as pd

iris = pd.read_csv('iris.csv')
iris = iris.sample(frac=1, random_state=2).reset_index(drop=True)
iris.head()
Out[5]:
sepal_length sepal_width petal_length petal_width species
0 4.6 3.4 1.4 0.3 setosa
1 4.6 3.1 1.5 0.2 setosa
2 5.7 2.5 5.0 2.0 virginica
3 4.8 3.0 1.4 0.1 setosa
4 4.8 3.4 1.9 0.2 setosa
In [6]:
# Only include first two training features (sepal length and sepal width)
feature_columns = ['sepal_length', 'sepal_width']
X = iris[feature_columns].values
y = iris['species'].values

print(f'First 5 samples in X: \n{X[:5]}')
print(f'First 5 labels in y: \n{y[:5]}')
First 5 samples in X: 
[[4.6 3.4]
 [4.6 3.1]
 [5.7 2.5]
 [4.8 3. ]
 [4.8 3.4]]
First 5 labels in y: 
['setosa' 'setosa' 'virginica' 'setosa' 'setosa']
In [7]:
plotting.plot_2D_iris()

Estimators in scikit-learn

  • Algorithms are implemented as estimator classes in scikit-learn

  • Each estimator in scikit-learn is extensively documented (e.g. the KNeighborsClassifier documentation) with API documentation, user guides, and example usages.

In [8]:
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.svm import SVC, SVR
from sklearn.linear_model import LinearRegression, LogisticRegression
  • A model is an instance of one of these estimator classes
In [9]:
model = KNeighborsClassifier(n_neighbors=5)
print(model)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Estimator API


class Estimator(BaseClass):

    def __init__(self, **hyperparameters):
        # Setup Estimator here

    def fit(self, X, y):
        # Implement algorithm here

        return self

    def predict(self, X):
        # Get predicted target from trained model
        # Note: fit must be called before predict

        return y_pred


See API design for machine learning software: experiences from the scikit-learn project for a discusses of the API design choices for scikit-learn

Training a model — fit then predict

In [10]:
# Create the model
model = KNeighborsClassifier(n_neighbors=5)

# Fit the model
model.fit(X, y)

# Get model predictions
y_pred = model.predict(X)
y_pred[:10]
Out[10]:
array(['setosa', 'setosa', 'versicolor', 'setosa', 'setosa', 'virginica',
       'setosa', 'versicolor', 'virginica', 'setosa'], dtype=object)

Example algorithm: decision tree classifier

Decision tree classifier

Idea behind the decision tree algorithm is to sequentially partition a training dataset by asking a series of questions.

Decision tree

Image source: Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.

Node splitting to maximize purity

Decision tree

Decision tree classifier in scikit-learn

In [11]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X, y)
Out[11]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Visualizing decision trees — tree graph

In [12]:
plotting.plot_decision_tree(clf)
Out[12]:
Tree 0 sepal length (cm) ≤ 5.45 samples = 150 value = [50, 50, 50] class = setosa 1 sepal width (cm) ≤ 2.8 samples = 52 value = [45, 6, 1] class = setosa 0->1 True 4 sepal length (cm) ≤ 6.15 samples = 98 value = [5, 44, 49] class = virginica 0->4 False 2 samples = 7 value = [1, 5, 1] class = versicolor 1->2 3 samples = 45 value = [44, 1, 0] class = setosa 1->3 5 samples = 43 value = [5, 28, 10] class = versicolor 4->5 6 samples = 55 value = [0, 16, 39] class = virginica 4->6

Visualizing decision trees — decision regions

In [13]:
plotting.plot_tree_decision_regions(clf)

Model validation

Model performance metrics

  • There are many different performance metrics for classification and regression problems. Which metric you should use depends on the particular problem you are working on

  • Many commonly used performance metrics are built into the metrics subpackage in scikit-learn

  • Custom user-defined scoring function can be created using the sklearn.metrics.make_scorer function

In [14]:
# Classification metrics
from sklearn.metrics import (accuracy_score, precision_score, 
                             recall_score, f1_score, log_loss)
# Regression metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
In [15]:
y_true = [0, 1, 1, 3, 2]
y_pred = [0, 2, 1, 3, 1]
In [16]:
accuracy_score(y_true, y_pred)
Out[16]:
0.6
In [17]:
mean_squared_error(y_true, y_pred)
Out[17]:
0.4

Separate training & testing sets

  • A trained model will generally perform better on data that was used to train it

  • Want to measure how well a model generalizes to new, unseen data

  • Need to have two separate datasets. One for training models and one for evaluating model performance

  • scikit-learn has a convenient train_test_split function that randomly splits a dataset into a testing and training set

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=2)

print(f'X.shape = {X.shape}')
print(f'X_test.shape = {X_test.shape}')
print(f'X_train.shape = {X_train.shape}')
X.shape = (150, 2)
X_test.shape = (30, 2)
X_train.shape = (120, 2)
In [19]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
print(f'training accuracy = {accuracy_score(y_train, clf.predict(X_train))}')
print(f'testing accuracy = {accuracy_score(y_test, clf.predict(X_test))}')
training accuracy = 0.9333333333333333
testing accuracy = 0.7

Model selection — hyperparameter optimization

  • Choose model hyperparameter values to avoid under- and over-fitting

  • Under-fitting — model isn't sufficiently complex enough to properly model the dataset at hand

  • Over-fitting — model is too complex and begins to learn the noise in the training dataset

overview

Image source: Underfitting vs. Overfitting in scikit-learn examples

$k$-fold cross validation diagram

Classical programming vs. machine learning. From *Deep Learning with Python* by Francois Chollet.

Image source: Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.

Cross validation in scikit-learn

In [20]:
from sklearn.model_selection import cross_validate

clf = DecisionTreeClassifier(max_depth=2)
scores = cross_validate(clf, X_train, y_train,
                        scoring='accuracy', cv=10,
                        return_train_score=True)

print(scores.keys())
test_scores = scores['test_score']
train_scores = scores['train_score']
print(test_scores)
print(train_scores)

print('\n10-fold CV scores:')
print(f'training score = {np.mean(train_scores)} +/- {np.std(train_scores)}')
print(f'validation score = {np.mean(test_scores)} +/- {np.std(test_scores)}')
dict_keys(['fit_time', 'score_time', 'test_score', 'train_score'])
[0.84615385 0.76923077 0.75       0.58333333 0.91666667 0.66666667
 0.91666667 0.83333333 0.63636364 0.72727273]
[0.76635514 0.77570093 0.77777778 0.7962963  0.75925926 0.78703704
 0.75925926 0.76851852 0.78899083 0.74311927]

10-fold CV scores:
training score = 0.7722314314657621 +/- 0.015344020267747309
validation score = 0.7645687645687647 +/- 0.10869446623132276

Validation curves

Validation curves are a good way to diagnose if a model is under- or over-fitting

In [21]:
plotting.plot_validation_curve()
In [22]:
plotting.plot_max_depth_validation(clf, X_train, y_train)

Hyperparameter tuning via GridSearchCV

  • In practice, you'll want to optimize many different hyperparameter values simultaneously

  • The GridSearchCV object in scikit-learn's model_selection subpackage can be used to scan over many different hyperparameter combinations

  • Calculates cross-validated training and testing scores for each hyperparameter combinations

  • The combination that maximizes the testing score is deemed to be the "best estimator"

In [23]:
from sklearn.model_selection import GridSearchCV

# Instantiate a model
clf = DecisionTreeClassifier()

# Specify hyperparameter values to test
parameters = {'max_depth': range(1, 20),
              'criterion': ['gini', 'entropy']}

# Run grid search
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)

# Get best model
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')
print(gridsearch.best_estimator_)
gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Supervised machine learning workflow

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

Step 1 — Separate training and testing datasets

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=2)

Steps 2 & 3 — Optimize hyperparameters via cross validation

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

In [25]:
clf = DecisionTreeClassifier()
parameters = {'max_depth': range(1, 20),
              'criterion': ['gini', 'entropy']}
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')

best_clf = gridsearch.best_estimator_
best_clf
gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3}
Out[25]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Steps 4 — Model performance

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

In [26]:
y_pred = best_clf.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f'test_acc = {test_acc}')
test_acc = 0.8222222222222222

Steps 5 — Train final model on full dataset

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

In [27]:
final_model = DecisionTreeClassifier(**gridsearch.best_params_)
final_model.fit(X, y)
Out[27]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Iris classification problem

In [28]:
# Step 1: Get training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=2)

# Step 2: Use GridSearchCV to find optimal hyperparameter values
clf = DecisionTreeClassifier(random_state=2)
parameters = {'max_depth': range(1, 20),
              'criterion': ['gini', 'entropy']}
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')

# Step 3: Get model with best hyperparameters
best_clf = gridsearch.best_estimator_

# Step 4: Get best model performance from testing set
y_pred = best_clf.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f'test_acc = {test_acc}')

# Step 5: Train final model on full dataset
final_model = DecisionTreeClassifier(random_state=2, **gridsearch.best_params_)
final_model.fit(X, y);
gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3}
test_acc = 0.8222222222222222

Additional Resources

  • Python Machine Learning by Sebastian Raschka [GitHub][Amazon]

  • Data Science Handbook by Jake VanderPlas [GitHub][Amazon]

  • The Elements of Statistical Learning by Hastie, Tibshirani and Friedman [Free book!]

  • Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville [Amazon]

Thank you

Any questions?