Getting started with machine learning
using scikit-learn

James Bourbeau¶

Big Data Madison Meetup¶

April 24, 2018

GitHub repo with materials:¶

https://github.com/jrbourbeau/big-data-madison-ml-sklearn

Slides:¶

https://jrbourbeau.github.io/big-data-madison-ml-sklearn

Contact:¶

E-mail: james@jamesbourbeau.com

GitHub: jrbourbeau

Twitter: __jrbourbeau__

LinkedIn: jrbourbeau

Source code for plotting Python module can be found on GitHub with the rest of the materials for this talk

In [1]:

import plotting
import numpy as np
np.random.seed(2)
%matplotlib inline

Supervised machine learning workflow¶

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

Outline¶

What is machine learning?
- Classical programming vs. machine learning
- Supervised machine learning
scikit-learn:
- Data representation
- Estimator API
Example algorithm: decision tree classifier
Model validation
- Cross validation
- Validation curves

Machine learning vs. classical programming¶

Classical programming¶

Devise a set of rules (an algorithm) that are used to accomplish a task
For example, labeling e-mails as either "spam" or "not spam"

In [2]:

def spam_filter(email):
    """Function that labels an email as 'spam' or 'not spam'
    """
    if 'Act now!' in email.contents:
        label = 'spam'
    elif 'hotmail.com' in email.sender:
        label = 'spam'
    elif email.contents.count('$') > 20:
        label = 'spam'
    else:
        label = 'not spam'

    return label

Machine learning¶

"Field of study that gives computers the ability to learn without being explicitly programmed" — Arthur Samuel (1959)
"A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task." — Francois Chollet, Deep Learning with Python

Supervised machine learning¶

From a labeled dataset, an algorithm learns a mapping between input data and the desired output label
Goal is to have model generalize well to future, yet unseen, data
Supervised machine learning is further divided into two types of problems:
- Classification — Labels are discrete. E.g. determine if a picture is of a cat, dog, or person.
- Regression — Labels are continuous. E.g. predict home prices.

In [3]:

plotting.plot_classification_vs_regression()

Machine learning in Python with scikit-learn¶

scikit-learn¶

Popular Python machine learning library
Designed to be a well documented and approachable for non-specialist
Built on top of NumPy and SciPy
scikit-learn can be easily installed with pip or conda
- pip install scikit-learn
- conda install scikit-learn

Data representation in scikit-learn¶

Training dataset is described by a pair of matrices, one for the input data and one for the output
Most commonly used data formats are a NumPy ndarray or a Pandas DataFrame / Series

Each row of these matrices corresponds to one sample of the dataset
Each column represents a quantitative piece of information that is used to describe each sample (called "features")

In [4]:

plotting.plot_data_representation()

Iris dataset¶

Dataset consists of 150 samples (individual flowers) that have 4 features: sepal length, sepal width, petal length, and petal width (all in cm)
Each sample is labeled by its species: Iris Setosa, Iris Versicolour, Iris Virginica
Task is to develop a model that predicts iris species
Iris dataset is freely available from the UCI Machine Learning Repository

Iris dataset

Loading the iris dataset¶

In [5]:

import pandas as pd

iris = pd.read_csv('iris.csv')
iris = iris.sample(frac=1, random_state=2).reset_index(drop=True)
iris.head()

Out[5]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	4.6	3.4	1.4	0.3	setosa
1	4.6	3.1	1.5	0.2	setosa
2	5.7	2.5	5.0	2.0	virginica
3	4.8	3.0	1.4	0.1	setosa
4	4.8	3.4	1.9	0.2	setosa

In [6]:

# Only include first two training features (sepal length and sepal width)
feature_columns = ['sepal_length', 'sepal_width']
X = iris[feature_columns].values
y = iris['species'].values

print(f'First 5 samples in X: \n{X[:5]}')
print(f'First 5 labels in y: \n{y[:5]}')

First 5 samples in X: 
[[4.6 3.4]
 [4.6 3.1]
 [5.7 2.5]
 [4.8 3. ]
 [4.8 3.4]]
First 5 labels in y: 
['setosa' 'setosa' 'virginica' 'setosa' 'setosa']

In [7]:

plotting.plot_2D_iris()

Estimators in scikit-learn¶

Algorithms are implemented as estimator classes in scikit-learn
Each estimator in scikit-learn is extensively documented (e.g. the KNeighborsClassifier documentation) with API documentation, user guides, and example usages.

In [8]:

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.svm import SVC, SVR
from sklearn.linear_model import LinearRegression, LogisticRegression

A model is an instance of one of these estimator classes

In [9]:

model = KNeighborsClassifier(n_neighbors=5)
print(model)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Estimator API¶

class Estimator(BaseClass):

    def __init__(self, **hyperparameters):
        # Setup Estimator here

    def fit(self, X, y):
        # Implement algorithm here

        return self

    def predict(self, X):
        # Get predicted target from trained model
        # Note: fit must be called before predict

        return y_pred

See API design for machine learning software: experiences from the scikit-learn project for a discusses of the API design choices for scikit-learn

Training a model — fit then predict

In [10]:

# Create the model
model = KNeighborsClassifier(n_neighbors=5)

# Fit the model
model.fit(X, y)

# Get model predictions
y_pred = model.predict(X)
y_pred[:10]

Out[10]:

array(['setosa', 'setosa', 'versicolor', 'setosa', 'setosa', 'virginica',
       'setosa', 'versicolor', 'virginica', 'setosa'], dtype=object)

Example algorithm: decision tree classifier¶

Decision tree classifier¶

Idea behind the decision tree algorithm is to sequentially partition a training dataset by asking a series of questions.

Decision tree

Image source: Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.

Node splitting to maximize purity¶

Decision tree

Decision tree classifier in scikit-learn¶

In [11]:

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X, y)

Out[11]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Visualizing decision trees — tree graph

In [12]:

plotting.plot_decision_tree(clf)

Out[12]:

Visualizing decision trees — decision regions

In [13]:

plotting.plot_tree_decision_regions(clf)

Model validation¶

Model performance metrics¶

There are many different performance metrics for classification and regression problems. Which metric you should use depends on the particular problem you are working on
Many commonly used performance metrics are built into the metrics subpackage in scikit-learn
Custom user-defined scoring function can be created using the sklearn.metrics.make_scorer function

In [14]:

# Classification metrics
from sklearn.metrics import (accuracy_score, precision_score, 
                             recall_score, f1_score, log_loss)
# Regression metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [15]:

y_true = [0, 1, 1, 3, 2]
y_pred = [0, 2, 1, 3, 1]

In [16]:

accuracy_score(y_true, y_pred)

Out[16]:

0.6

In [17]:

mean_squared_error(y_true, y_pred)

Out[17]:

0.4

Separate training & testing sets¶

A trained model will generally perform better on data that was used to train it
Want to measure how well a model generalizes to new, unseen data
Need to have two separate datasets. One for training models and one for evaluating model performance
scikit-learn has a convenient train_test_split function that randomly splits a dataset into a testing and training set

In [18]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=2)

print(f'X.shape = {X.shape}')
print(f'X_test.shape = {X_test.shape}')
print(f'X_train.shape = {X_train.shape}')

X.shape = (150, 2)
X_test.shape = (30, 2)
X_train.shape = (120, 2)

In [19]:

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
print(f'training accuracy = {accuracy_score(y_train, clf.predict(X_train))}')
print(f'testing accuracy = {accuracy_score(y_test, clf.predict(X_test))}')

training accuracy = 0.9333333333333333
testing accuracy = 0.7

Model selection — hyperparameter optimization

Choose model hyperparameter values to avoid under- and over-fitting
Under-fitting — model isn't sufficiently complex enough to properly model the dataset at hand
Over-fitting — model is too complex and begins to learn the noise in the training dataset

overview

Image source: Underfitting vs. Overfitting in scikit-learn examples

$k$-fold cross validation diagram¶

Classical programming vs. machine learning. From *Deep Learning with Python* by Francois Chollet.

Image source: Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.

Cross validation in scikit-learn¶

In [20]:

from sklearn.model_selection import cross_validate

clf = DecisionTreeClassifier(max_depth=2)
scores = cross_validate(clf, X_train, y_train,
                        scoring='accuracy', cv=10,
                        return_train_score=True)

print(scores.keys())
test_scores = scores['test_score']
train_scores = scores['train_score']
print(test_scores)
print(train_scores)

print('\n10-fold CV scores:')
print(f'training score = {np.mean(train_scores)} +/- {np.std(train_scores)}')
print(f'validation score = {np.mean(test_scores)} +/- {np.std(test_scores)}')

dict_keys(['fit_time', 'score_time', 'test_score', 'train_score'])
[0.84615385 0.76923077 0.75       0.58333333 0.91666667 0.66666667
 0.91666667 0.83333333 0.63636364 0.72727273]
[0.76635514 0.77570093 0.77777778 0.7962963  0.75925926 0.78703704
 0.75925926 0.76851852 0.78899083 0.74311927]

10-fold CV scores:
training score = 0.7722314314657621 +/- 0.015344020267747309
validation score = 0.7645687645687647 +/- 0.10869446623132276

Validation curves¶

Validation curves are a good way to diagnose if a model is under- or over-fitting

In [21]:

plotting.plot_validation_curve()

In [22]:

plotting.plot_max_depth_validation(clf, X_train, y_train)

Hyperparameter tuning via GridSearchCV¶

In practice, you'll want to optimize many different hyperparameter values simultaneously
The GridSearchCV object in scikit-learn's model_selection subpackage can be used to scan over many different hyperparameter combinations
Calculates cross-validated training and testing scores for each hyperparameter combinations
The combination that maximizes the testing score is deemed to be the "best estimator"

In [23]:

from sklearn.model_selection import GridSearchCV

# Instantiate a model
clf = DecisionTreeClassifier()

# Specify hyperparameter values to test
parameters = {'max_depth': range(1, 20),
              'criterion': ['gini', 'entropy']}

# Run grid search
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)

# Get best model
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')
print(gridsearch.best_estimator_)

gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Supervised machine learning workflow¶

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

Step 1 — Separate training and testing datasets

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

In [24]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=2)

Steps 2 & 3 — Optimize hyperparameters via cross validation

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

In [25]:

clf = DecisionTreeClassifier()
parameters = {'max_depth': range(1, 20),
              'criterion': ['gini', 'entropy']}
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')

best_clf = gridsearch.best_estimator_
best_clf

gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3}

Out[25]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Steps 4 — Model performance

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

In [26]:

y_pred = best_clf.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f'test_acc = {test_acc}')

test_acc = 0.8222222222222222

Steps 5 — Train final model on full dataset

overview

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

In [27]:

final_model = DecisionTreeClassifier(**gridsearch.best_params_)
final_model.fit(X, y)

Out[27]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Iris classification problem¶

In [28]:

# Step 1: Get training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=2)

# Step 2: Use GridSearchCV to find optimal hyperparameter values
clf = DecisionTreeClassifier(random_state=2)
parameters = {'max_depth': range(1, 20),
              'criterion': ['gini', 'entropy']}
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')

# Step 3: Get model with best hyperparameters
best_clf = gridsearch.best_estimator_

# Step 4: Get best model performance from testing set
y_pred = best_clf.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f'test_acc = {test_acc}')

# Step 5: Train final model on full dataset
final_model = DecisionTreeClassifier(random_state=2, **gridsearch.best_params_)
final_model.fit(X, y);

gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3}
test_acc = 0.8222222222222222

Additional Resources¶

Python Machine Learning by Sebastian Raschka [GitHub][Amazon]
Data Science Handbook by Jake VanderPlas [GitHub][Amazon]
The Elements of Statistical Learning by Hastie, Tibshirani and Friedman [Free book!]
Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville [Amazon]

Getting started with machine learning using scikit-learn

James Bourbeau¶

Big Data Madison Meetup¶

GitHub repo with materials:¶

Slides:¶

Contact:¶

Supervised machine learning workflow¶

Outline¶

Machine learning vs. classical programming¶

Classical programming¶

Machine learning¶

Supervised machine learning¶

Machine learning in Python with scikit-learn¶

scikit-learn¶

Data representation in scikit-learn¶

Iris dataset¶

Loading the iris dataset¶

Estimators in scikit-learn¶

Estimator API¶

Training a model — fit then predict

Example algorithm: decision tree classifier¶

Decision tree classifier¶

Node splitting to maximize purity¶

Decision tree classifier in scikit-learn¶

Visualizing decision trees — tree graph

Visualizing decision trees — decision regions

Model validation¶

Model performance metrics¶

Separate training & testing sets¶

Model selection — hyperparameter optimization

$k$-fold cross validation diagram¶

Cross validation in scikit-learn¶

Validation curves¶

Hyperparameter tuning via GridSearchCV¶

Supervised machine learning workflow¶

Step 1 — Separate training and testing datasets

Steps 2 & 3 — Optimize hyperparameters via cross validation

Steps 4 — Model performance

Steps 5 — Train final model on full dataset

Iris classification problem¶

Additional Resources¶

Thank you¶

Any questions?¶

Getting started with machine learning
using scikit-learn