https://github.com/jrbourbeau/madpy-ml-sklearn-2018

https://jrbourbeau.github.io/madpy-ml-sklearn-2018

E-mail: james@jamesbourbeau.com

GitHub: jrbourbeau

LinkedIn: jrbourbeau

Source code for `plotting`

Python module can be found on GitHub with the rest of the materials for this talk

In [1]:

```
import plotting
import numpy as np
np.random.seed(2)
%matplotlib inline
```

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

What is machine learning?

Classical programming vs. machine learning

Supervised machine learning

scikit-learn:

Data representation

Estimator API

Example algorithm: decision tree classifier

Model validation

Cross validation

Validation curves

Devise a set of rules (an algorithm) that are used to accomplish a task

For example, labeling e-mails as either "spam" or "not spam"

In [2]:

```
def spam_filter(email):
"""Function that labels an email as 'spam' or 'not spam'
"""
if 'Act now!' in email.contents:
label = 'spam'
elif 'hotmail.com' in email.sender:
label = 'spam'
elif email.contents.count('$') > 20:
label = 'spam'
else:
label = 'not spam'
return label
```

"Field of study that gives computers the ability to learn without being explicitly programmed" — Arthur Samuel (1959)

"A machine-learning system is trained rather than explicitly programmed. Itâ€™s presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task." — Francois Chollet,

*Deep Learning with Python*

From a labeled dataset, an algorithm learns a mapping between input data and the desired output label

Goal is to have model generalize well to future, yet unseen, data

Supervised machine learning is further divided into two types of problems:

Classification — Labels are discrete. E.g. determine if a picture is of a cat, dog, or person.

Regression — Labels are continuous. E.g. predict home prices.

In [3]:

```
plotting.plot_classification_vs_regression()
```

Popular Python machine learning library

Designed to be a well documented and approachable for non-specialist

Built on top of NumPy and SciPy

scikit-learn can be easily installed with

`pip`

or`conda`

`pip install scikit-learn`

`conda install scikit-learn`

API design for machine learning software: experiences from the scikit-learn project — for a discusses of the API design choices for scikit-learn

Training dataset is described by a pair of matrices, one for the input data and one for the output

Most commonly used data formats are a NumPy

`ndarray`

or a Pandas`DataFrame`

/`Series`

Each row of these matrices corresponds to one sample of the dataset

Each column represents a quantitative piece of information that is used to describe each sample (called "features")

In [4]:

```
plotting.plot_data_representation()
```

Dataset consists of 150 samples (individual flowers) that have 4 features: sepal length, sepal width, petal length, and petal width (all in cm)

Each sample is labeled by its species: Iris Setosa, Iris Versicolour, Iris Virginica

Task is to develop a model that predicts iris species

Iris dataset is freely available from the UCI Machine Learning Repository

In [5]:

```
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# Only include first two training features (sepal length and sepal width)
X = X[:, :2]
print(f'First 5 samples in X: \n{X[:5]}')
print(f'Labels: \n{y}')
```

In [6]:

```
plotting.plot_2D_iris()
```

Algorithms are implemented as estimator classes in scikit-learn

Each estimator in scikit-learn is extensively documented (e.g. the KNeighborsClassifier documentation) with API documentation, user guides, and example usages.

In [7]:

```
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.svm import SVC, SVR
from sklearn.linear_model import LinearRegression, LogisticRegression
```

- A model is an instance of one of these estimator classes

In [8]:

```
model = KNeighborsClassifier(n_neighbors=5)
print(model)
```

```
class Estimator(BaseClass):
def __init__(self, **hyperparameters):
# Setup Estimator here
def fit(self, X, y):
# Implement algorithm here
return self
def predict(self, X):
# Get predicted target from trained model
# Note: fit must be called before predict
return y_pred
```

In [9]:

```
# Create the model
model = KNeighborsClassifier(n_neighbors=5)
# Fit the model
model.fit(X, y)
# Get model predictions
y_pred = model.predict(X)
y_pred
```

Out[9]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1])

Idea behind the decision tree algorithm is to sequentially partition a training dataset by asking a series of questions.

Image source: Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.

In [10]:

```
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X, y)
```

Out[10]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')

In [11]:

```
plotting.plot_decision_tree(clf)
```

Out[11]:

In [12]:

```
plotting.plot_tree_decision_regions(clf)
```

There are many different performance metrics for classification and regression problems. Which metric you should use depends on the particular problem you are working on

Many commonly used performance metrics are built into the

`metrics`

subpackage in scikit-learnHowever, a user-defined scoring function can be created using the

`sklearn.metrics.make_scorer`

function

In [13]:

```
# Classification metrics
from sklearn.metrics import (accuracy_score, precision_score,
recall_score, f1_score, log_loss)
# Regression metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
```

In [14]:

```
y_pred = [0, 2, 1, 3, 1]
y_true = [0, 1, 1, 3, 2]
```

In [15]:

```
accuracy_score(y_true, y_pred)
```

Out[15]:

0.6

In [16]:

```
mean_squared_error(y_true, y_pred)
```

Out[16]:

0.4

A trained model will generally perform better on data that was used to train it

Want to measure how well a model generalizes to new, unseen data

Need to have two separate datasets. One for training models and one for evaluating model performance

scikit-learn has a convenient

`train_test_split`

function that randomly splits a dataset into a testing and training set

In [17]:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=2)
print(f'X.shape = {X.shape}')
print(f'X_test.shape = {X_test.shape}')
print(f'X_train.shape = {X_train.shape}')
```

X.shape = (150, 2) X_test.shape = (30, 2) X_train.shape = (120, 2)

In [18]:

```
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
print(f'training accuracy = {accuracy_score(y_train, clf.predict(X_train))}')
print(f'testing accuracy = {accuracy_score(y_test, clf.predict(X_test))}')
```

training accuracy = 0.9333333333333333 testing accuracy = 0.6666666666666666

Choose model hyperparameter values to avoid under- and over-fitting

Under-fitting — model isn't sufficiently complex enough to properly model the dataset at hand

Over-fitting — model is too complex and begins to learn the noise in the training dataset

Image source: Underfitting vs. Overfitting in scikit-learn examples

Image source: Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.

In [19]:

```
from sklearn.model_selection import cross_validate
clf = DecisionTreeClassifier(max_depth=2)
scores = cross_validate(clf, X_train, y_train,
scoring='accuracy', cv=10,
return_train_score=True)
print(scores.keys())
test_scores = scores['test_score']
train_scores = scores['train_score']
print(test_scores)
print(train_scores)
print('\n10-fold CV scores:')
print(f'training score = {np.mean(train_scores)} +/- {np.std(train_scores)}')
print(f'validation score = {np.mean(test_scores)} +/- {np.std(test_scores)}')
```

Validation curves are a good way to diagnose if a model is under- or over-fitting

In [20]:

```
plotting.plot_validation_curve()
```

In [21]:

```
plotting.plot_max_depth_validation(clf, X_train, y_train)
```

In practice, you'll want to optimize many different hyperparameter values simultaneously

The

`GridSearchCV`

object in scikit-learn's`model_selection`

subpackage can be used to scan over many different hyperparameter combinationsCalculates cross-validated training and testing scores for each hyperparameter combinations

The combination that maximizes the testing score is deemed to be the "best estimator"

In [22]:

```
from sklearn.model_selection import GridSearchCV
# Instantiate a model
clf = DecisionTreeClassifier()
# Specify hyperparameter values to test
parameters = {'max_depth': range(1, 20),
'criterion': ['gini', 'entropy']}
# Run grid search
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
# Get best model
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')
print(gridsearch.best_estimator_)
```

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

Image source: Model evaluation, model selection, and algorithm selection in machine learning by Sebastian Raschka

In [23]:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=2)
```

In [24]:

```
clf = DecisionTreeClassifier()
parameters = {'max_depth': range(1, 20),
'criterion': ['gini', 'entropy']}
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')
best_clf = gridsearch.best_estimator_
best_clf
```

gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3}

Out[24]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')

In [25]:

```
y_pred = best_clf.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f'test_acc = {test_acc}')
```

test_acc = 0.8

In [26]:

```
final_model = DecisionTreeClassifier(**gridsearch.best_params_)
final_model.fit(X, y)
```

Out[26]:

In [27]:

```
# Step 1: Get training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=2)
# Step 2: Use GridSearchCV to find optimal hyperparameter values
clf = DecisionTreeClassifier(random_state=2)
parameters = {'max_depth': range(1, 20),
'criterion': ['gini', 'entropy']}
gridsearch = GridSearchCV(clf, parameters, scoring='accuracy', cv=10)
gridsearch.fit(X_train, y_train)
print(f'gridsearch.best_params_ = {gridsearch.best_params_}')
# Step 3: Get model with best hyperparameters
best_clf = gridsearch.best_estimator_
# Step 4: Get best model performance from testing set
y_pred = best_clf.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f'test_acc = {test_acc}')
# Step 5: Train final model on full dataset
final_model = DecisionTreeClassifier(random_state=2, **gridsearch.best_params_)
final_model.fit(X, y);
```

gridsearch.best_params_ = {'criterion': 'gini', 'max_depth': 3} test_acc = 0.8