https://github.com/jrbourbeau/big-data-madison-ml-sklearn
https://jrbourbeau.github.io/big-data-madison-ml-sklearn
E-mail: james@jamesbourbeau.com
GitHub: jrbourbeau
Twitter: __jrbourbeau__
LinkedIn: jrbourbeau
Source code for plotting
Python module can be found on GitHub with the rest of the materials for this talk
import plotting
import numpy as np
np.random.seed(2)
%matplotlib inline
Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka
What is machine learning?
Classical programming vs. machine learning
Supervised machine learning
scikit-learn:
Data representation
Estimator API
Example algorithm: decision tree classifier
Model validation
Cross validation
Validation curves
Devise a set of rules (an algorithm) that are used to accomplish a task
For example, labeling e-mails as either "spam" or "not spam"
def spam_filter(email):
"""Function that labels an email as 'spam' or 'not spam'
"""
if 'Act now!' in email.contents:
label = 'spam'
elif 'hotmail.com' in email.sender:
label = 'spam'
elif email.contents.count('$') > 20:
label = 'spam'
else:
label = 'not spam'
return label
"Field of study that gives computers the ability to learn without being explicitly programmed" — Arthur Samuel (1959)
"A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task." — Francois Chollet, Deep Learning with Python
From a labeled dataset, an algorithm learns a mapping between input data and the desired output label
Goal is to have model generalize well to future, yet unseen, data
Supervised machine learning is further divided into two types of problems:
Classification — Labels are discrete. E.g. determine if a picture is of a cat, dog, or person.
Regression — Labels are continuous. E.g. predict home prices.
plotting.plot_classification_vs_regression()
Popular Python machine learning library
Designed to be a well documented and approachable for non-specialist
Built on top of NumPy and SciPy
scikit-learn can be easily installed with pip
or conda
pip install scikit-learn
conda install scikit-learn
Training dataset is described by a pair of matrices, one for the input data and one for the output
Most commonly used data formats are a NumPy ndarray
or a Pandas DataFrame
/ Series
Each row of these matrices corresponds to one sample of the dataset
Each column represents a quantitative piece of information that is used to describe each sample (called "features")
plotting.plot_data_representation()
Dataset consists of 150 samples (individual flowers) that have 4 features: sepal length, sepal width, petal length, and petal width (all in cm)
Each sample is labeled by its species: Iris Setosa, Iris Versicolour, Iris Virginica
Task is to develop a model that predicts iris species
Iris dataset is freely available from the UCI Machine Learning Repository
import pandas as pd
iris = pd.read_csv('iris.csv')
iris = iris.sample(frac=1, random_state=2).reset_index(drop=True)
iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
1 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
2 | 5.7 | 2.5 | 5.0 | 2.0 | virginica |
3 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
4 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
# Only include first two training features (sepal length and sepal width)
feature_columns = ['sepal_length', 'sepal_width']
X = iris[feature_columns].values
y = iris['species'].values
print(f'First 5 samples in X: \n{X[:5]}')
print(f'First 5 labels in y: \n{y[:5]}')
First 5 samples in X: [[4.6 3.4] [4.6 3.1] [5.7 2.5] [4.8 3. ] [4.8 3.4]] First 5 labels in y: ['setosa' 'setosa' 'virginica' 'setosa' 'setosa']
plotting.plot_2D_iris()
Algorithms are implemented as estimator classes in scikit-learn
Each estimator in scikit-learn is extensively documented (e.g. the KNeighborsClassifier documentation) with API documentation, user guides, and example usages.
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.svm import SVC, SVR
from sklearn.linear_model import LinearRegression, LogisticRegression
model = KNeighborsClassifier(n_neighbors=5)
print(model)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform')
class Estimator(BaseClass):
def __init__(self, **hyperparameters):
# Setup Estimator here
def fit(self, X, y):
# Implement algorithm here
return self
def predict(self, X):
# Get predicted target from trained model
# Note: fit must be called before predict
return y_pred
See API design for machine learning software:
experiences from the scikit-learn project for a discusses of the API design choices for scikit-learn
# Create the model
model = KNeighborsClassifier(n_neighbors=5)
# Fit the model
model.fit(X, y)
# Get model predictions
y_pred = model.predict(X)
y_pred[:10]
array(['setosa', 'setosa', 'versicolor', 'setosa', 'setosa', 'virginica', 'setosa', 'versicolor', 'virginica', 'setosa'], dtype=object)
Idea behind the decision tree algorithm is to sequentially partition a training dataset by asking a series of questions.
Image source: Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
plotting.plot_decision_tree(clf)
plotting.plot_tree_decision_regions(clf)
There are many different performance metrics for classification and regression problems. Which metric you should use depends on the particular problem you are working on
Many commonly used performance metrics are built into the metrics
subpackage in scikit-learn
Custom user-defined scoring function can be created using the sklearn.metrics.make_scorer
function
# Classification metrics
from sklearn.metrics import (accuracy_score, precision_score,
recall_score, f1_score, log_loss)
# Regression metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
y_true = [0, 1, 1, 3, 2]
y_pred = [0, 2, 1, 3, 1]
accuracy_score(y_true, y_pred)
0.6
mean_squared_error(y_true, y_pred)
0.4
A trained model will generally perform better on data that was used to train it
Want to measure how well a model generalizes to new, unseen data
Need to have two separate datasets. One for training models and one for evaluating model performance
scikit-learn has a convenient train_test_split
function that randomly splits a dataset into a testing and training set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=2)
print(f'X.shape = {X.shape}')
print(f'X_test.shape = {X_test.shape}')
print(f'X_train.shape = {X_train.shape}')
X.shape = (150, 2) X_test.shape = (30, 2) X_train.shape = (120, 2)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
print(f'training accuracy = {accuracy_score(y_train, clf.predict(X_train))}')
print(f'testing accuracy = {accuracy_score(y_test, clf.predict(X_test))}')
training accuracy = 0.9333333333333333 testing accuracy = 0.7
Choose model hyperparameter values to avoid under- and over-fitting
Under-fitting — model isn't sufficiently complex enough to properly model the dataset at hand
Over-fitting — model is too complex and begins to learn the noise in the training dataset
Image source: Underfitting vs. Overfitting in scikit-learn examples
Image source: Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
from sklearn.model_selection import cross_validate
clf = DecisionTreeClassifier(max_depth=2)
scores = cross_validate(clf, X_train, y_train,
scoring='accuracy', cv=10,
return_train_score=True)
print(scores.keys())
test_scores = scores['test_score']
train_scores = scores['train_score']
print(test_scores)
print(train_scores)
print('\n10-fold CV scores:')
print(f'training score = {np.mean(train_scores)} +/- {np.std(train_scores)}')
print(f'validation score = {np.mean(test_scores)} +/- {np.std(test_scores)}')
dict_keys(['fit_time', 'score_time', 'test_score', 'train_score']) [0.84615385 0.76923077 0.75 0.58333333 0.91666667 0.66666667 0.91666667 0.83333333 0.63636364 0.72727273] [0.76635514 0.77570093 0.77777778 0.7962963 0.75925926 0.78703704 0.75925926 0.76851852 0.78899083 0.74311927] 10-fold CV scores: training score = 0.7722314314657621 +/- 0.015344020267747309 validation score = 0.7645687645687647 +/- 0.10869446623132276
Validation curves are a good way to diagnose if a model is under- or over-fitting
plotting.plot_validation_curve()
plotting.plot_max_depth_validation(clf, X_train, y_train)
In practice, you'll want to optimize many different hyperparameter values simultaneously
The GridSearchCV
object in scikit-learn's model_selection
subpackage can be used to scan over many different hyperparameter combinations
Calculates cross-validated training and testing scores for each hyperparameter combinations
The combination that maximizes the testing score is deemed to be the "best estimator"
from sklearn.model_selection import GridSearchCV
# Instantiate a model
clf = DecisionTreeClassifier()
# Specify hyperparameter values to test
parameters = {'max_depth': range(1, 20),
'criterion': ['gini', 'entropy']}
# Run grid search
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
# Get best model
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')
print(gridsearch.best_estimator_)
gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3} DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka
Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=2)
Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka
clf = DecisionTreeClassifier()
parameters = {'max_depth': range(1, 20),
'criterion': ['gini', 'entropy']}
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')
best_clf = gridsearch.best_estimator_
best_clf
gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka
y_pred = best_clf.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f'test_acc = {test_acc}')
test_acc = 0.8222222222222222
Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka
final_model = DecisionTreeClassifier(**gridsearch.best_params_)
final_model.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
# Step 1: Get training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=2)
# Step 2: Use GridSearchCV to find optimal hyperparameter values
clf = DecisionTreeClassifier(random_state=2)
parameters = {'max_depth': range(1, 20),
'criterion': ['gini', 'entropy']}
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')
# Step 3: Get model with best hyperparameters
best_clf = gridsearch.best_estimator_
# Step 4: Get best model performance from testing set
y_pred = best_clf.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f'test_acc = {test_acc}')
# Step 5: Train final model on full dataset
final_model = DecisionTreeClassifier(random_state=2, **gridsearch.best_params_)
final_model.fit(X, y);
gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3} test_acc = 0.8222222222222222