Decision trees in python again, cross-validation

warning

This post is more than 5 years old. While math doesn't age, code and operating systems do. Please use the code/ideas with caution and expect some issues due to the age of the content. I am keeping these posts up for archival purposes because I still find them useful for reference, even when they are out of date!

This is my second post on decision trees using scikit-learn and Python. The first, Decision trees in python with scikit-learn and pandas focused on visualizing the resulting tree. This post will concentrate on using cross-validation methods to choose the parameters used to train the tree. In particular, I'll focus on grid* and random search for decision tree parameter settings. If this sounds interesting to you, following along. As always, comments, questions and corrections are welcome below.

gist

Okay, let's get started by linking to the gist containing all of the needed code-- linked here. So, you don't have to copy and paste (unless you want to). The following code is adapted from this scikit-learn example (be careful-- although the example looks like it uses the iris data it really loads the digits data). Also, that example uses a random forest-- an ensemble of decision trees-- so the parameters that can be searched are slightly different.

imports

First we import all of the code we'll need for the post-- this expands on the methods needed in the last post:

from __future__ import print_function

import os
import subprocess

from time import time
from operator import itemgetter
from scipy.stats import randint

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.grid_search import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV
from sklearn.cross_validation import cross_val_score

The main additions are methods from sklearn.grid_search as well as some tools to:

  • time the searches, using time,
  • sort the results, using itemgetter, and
  • generate random integers, using scipy.stats.randint.

Now we can start writing our functions-- some new, some old.

previous functions

I will also declare functions used in the previous post so that I can use them here. These include:

  • get_code -- writes pseudo-code for a decision tree,
  • visualize_tree -- to generate a graphic of a decision tree. The ability to name output files has been added here.
  • encode_target -- process raw data for use with scikit-learn.
  • get_iris_data -- grabs iris.csv from the web, if needed, and writes a copy to the local directory. This is mainly to replicate real-world usage of pandas and scikit-learn.

def get_code(tree, feature_names, target_names,
spacer_base=" "):
"""Produce pseudo-code for decision tree.

Args
----
tree -- scikit-leant Decision Tree.
feature_names -- list of feature names.
target_names -- list of target (class) names.
spacer_base -- used for spacing code (default: " ").

Notes
-----
based on http://stackoverflow.com/a/30104792.
"""

left = tree.tree_.children_left
right = tree.tree_.children_right
threshold = tree.tree_.threshold
features = [feature_names[i] for i in tree.tree_.feature]
value = tree.tree_.value

def recurse(left, right, threshold, features, node, depth):
spacer = spacer_base * depth
if (threshold[node] != -2):
print(spacer + "if ( " + features[node] + " <= " + \
str(threshold[node]) + " ) {")
if left[node] != -1:
recurse (left, right, threshold, features,
left[node], depth+1)
print(spacer + "}\n" + spacer +"else {")
if right[node] != -1:
recurse (left, right, threshold, features,
right[node], depth+1)
print(spacer + "}")
else:
target = value[node]
for i, v in zip(np.nonzero(target)[1],
target[np.nonzero(target)]):
target_name = target_names[i]
target_count = int(v)
print(spacer + "return " + str(target_name) + \
" ( " + str(target_count) + " examples )")

recurse(left, right, threshold, features, 0, 0)


def visualize_tree(tree, feature_names, fn="dt"):
"""Create tree png using graphviz.

Args
----
tree -- scikit-learn Decision Tree.
feature_names -- list of feature names.
fn -- [string], root of filename, default `dt`.
"""

dotfile = fn + ".dot"
pngfile = fn + ".png"

with open(dotfile, 'w') as f:
export_graphviz(tree, out_file=f,
feature_names=feature_names)

command = ["dot", "-Tpng", dotfile, "-o", pngfile]
try:
subprocess.check_call(command)
except:
exit("Could not run dot, ie graphviz, "
"to produce visualization")


def encode_target(df, target_column):
"""Add column to df with integers for the target.

Args
----
df -- pandas Data Frame.
target_column -- column to map to int, producing new
Target column.

Returns
-------
df -- modified Data Frame.
targets -- list of target names.
"""

df_mod = df.copy()
targets = df_mod[target_column].unique()
map_to_int = {name: n for n, name in enumerate(targets)}
df_mod["Target"] = df_mod[target_column].replace(map_to_int)

return (df_mod, targets)


def get_iris_data():
"""Get the iris data, from local csv or pandas repo."""
if os.path.exists("iris.csv"):
print("-- iris.csv found locally")
df = pd.read_csv("iris.csv", index_col=0)
else:
print("-- trying to download from github")
fn = ("https://raw.githubusercontent.com/pydata/"
"pandas/master/pandas/tests/data/iris.csv")
try:
df = pd.read_csv(fn)
except:
exit("-- Unable to download iris.csv")

with open("iris.csv", 'w') as f:
print("-- writing to local iris.csv file")
df.to_csv(f)

return df

new functions

report

Next we add some new function to do the grid and random searches as well as report on the top parameters found. First up is report. This function takes the output from the grid or random search, prints a report of the top models and returns the best parameter setting.

def report(grid_scores, n_top=3):
"""Report top n_top parameters settings, default n_top=3.

Args
----
grid_scores -- output from grid or random search
n_top -- how many to report, of top models

Returns
-------
top_params -- [dict] top parameter settings found in
search
"""

top_scores = sorted(grid_scores,
key=itemgetter(1),
reverse=True)[:n_top]
for i, score in enumerate(top_scores):
print("Model with rank: {0}".format(i + 1))
print(("Mean validation score: "
"{0:.3f} (std: {1:.3f})").format(
score.mean_validation_score,
np.std(score.cv_validation_scores)))
print("Parameters: {0}".format(score.parameters))
print("")

return top_scores[0].parameters

grid search

Next up is run_gridsearch. This function takes

  • the features X,
  • the targets y,
  • a (Decision Tree) classifier clf,
  • a dictionary of parameters to try param_grid
  • the fold of the cross-validation cv, defaulted to 5-- this is discussed more below.

The param_grid is the set of parameters that will be tested-- be careful not to list too many options, because all combinations will be tested!

def run_gridsearch(X, y, clf, param_grid, cv=5):
"""Run a grid search for best Decision Tree parameters.

Args
----
X -- features
y -- targets (classes)
cf -- scikit-learn Decision Tree
param_grid -- [dict] parameter settings to test
cv -- fold of cross-validation, default 5

Returns
-------
top_params -- [dict] from report()
"""

grid_search = GridSearchCV(clf,
param_grid=param_grid,
cv=cv)
start = time()
grid_search.fit(X, y)

print(("\nGridSearchCV took {:.2f} "
"seconds for {:d} candidate "
"parameter settings.").format(time() - start,
len(grid_search.grid_scores_)))

top_params = report(grid_search.grid_scores_, 3)
return top_params

random search

Next up is the function run_randomsearch, which samples parameters from specified lists or distributions. Similar to the grid search, the arguments are:

  • the features X
  • the targets y
  • a (Decision Tree) classifier clf
  • the fold of the cross-validation cv, defaulted to 5-- this is discussed more below
  • the number of random parameter setting to consider n_iter_search, defaulted to 20.

def run_randomsearch(X, y, clf, para_dist, cv=5,
n_iter_search=20):
"""Run a random search for best Decision Tree parameters.

Args
----
X -- features
y -- targets (classes)
cf -- scikit-learn Decision Tree
param_dist -- [dict] list, distributions of parameters
to sample
cv -- fold of cross-validation, default 5
n_iter_search -- number of random parameter sets to try,
default 20.

Returns
-------
top_params -- [dict] from report()
"""

random_search = RandomizedSearchCV(clf,
param_distributions=param_dist,
n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print(("\nRandomizedSearchCV took {:.2f} seconds "
"for {:d} candidates parameter "
"settings.").format((time() - start),
n_iter_search))

top_params = report(random_search.grid_scores_, 3)
return top_params

Okay, we've defined all our functions-- let's use them!

cross-validation

getting the data

Next, let's run through finding good parameter settings, using the search methods that we've setup above. First some preliminaries-- get the data and construct the feature and target data:

print("\n-- get data:")
df = get_iris_data()
print("")

features = ["SepalLength", "SepalWidth",
"PetalLength", "PetalWidth"]
df, targets = encode_target(df, "Name")
y = df["Target"]
X = df[features]

producing...

-- get data:
-- iris.csv found locally

a first cross-validation

Next, let's do cross-validation using the parameters from the previous post-- Decision trees in python with scikit-learn and pandas . I'll use 10-fold cross-validation in all of the examples to follow. This choice means:

  • split the data into 10 parts
  • fit on 9-parts
  • test accuracy on the remaining part

This is repeated on all combinations to produce ten estimates of the accuracy of the model using the current parameter setting. Typically the mean and standard deviation of the ten scores is reported. So, if we use the setting from the previous post, we get:

print("-- 10-fold cross-validation "
"[using setup from previous post]")
dt_old = DecisionTreeClassifier(min_samples_split=20,
random_state=99)
dt_old.fit(X, y)
scores = cross_val_score(dt_old, X, y, cv=10)
print("mean: {:.3f} (std: {:.3f})".format(scores.mean(),
scores.std()),
end="\n\n" )

producing...

-- 10-fold cross-validation [using setup from previous post]
mean: 0.960 (std: 0.033)

Okay, 0.960 is not bad. That means that the average accuracy (percentage of correct classifications using the trained model) is 96%. That accuracy is pretty high, but let's if see if better parameters can be found.

application of grid search

First, I'll try a grid search. The dictionary param_grid provides the different parameter settings to test. The keys are the parameter name and the values are a list of settings to try. Make sure to check the documentation for Decision Trees (or other method you are testing) to get correct parameter names and valid values to be tested.

print("-- Grid Parameter Search via 10-fold CV")

# set of parameters to test
param_grid = {"criterion": ["gini", "entropy"],
"min_samples_split": [2, 10, 20],
"max_depth": [None, 2, 5, 10],
"min_samples_leaf": [1, 5, 10],
"max_leaf_nodes": [None, 5, 10, 20],
}

dt = DecisionTreeClassifier()
ts_gs = run_gridsearch(X, y, dt, param_grid, cv=10)

producing...

-- Grid Parameter Search via 10-fold CV

GridSearchCV took 5.02 seconds for 288 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.967 (std: 0.033)
Parameters: {'min_samples_split': 10, 'max_leaf_nodes': 5,
'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1}

Model with rank: 2
Mean validation score: 0.967 (std: 0.033)
Parameters: {'min_samples_split': 20, 'max_leaf_nodes': 5,
'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1}

Model with rank: 3
Mean validation score: 0.967 (std: 0.033)
Parameters: {'min_samples_split': 10, 'max_leaf_nodes': 5,
'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1}

In most runs I get a mean of 0.967 for a variety of parameter settings. This means there is an improvement from 96% to 96.7%-- I guess every bit helps! We can see the best parameter setting ts_gs as shown below:

print("\n-- Best Parameters:")
for k, v in ts_gs.items():
print("parameter: {:<20s} setting: {}".format(k, v))

producing...

-- Best Parameters:
parameter: min_samples_split setting: 10
parameter: max_leaf_nodes setting: 5
parameter: criterion setting: gini
parameter: max_depth setting: None
parameter: min_samples_leaf setting: 1

and, replicate the cross-validation results:

# test the retuned best parameters
print("\n\n-- Testing best parameters [Grid]...")
dt_ts_gs = DecisionTreeClassifier(**ts_gs)
scores = cross_val_score(dt_ts_gs, X, y, cv=10)
print("mean: {:.3f} (std: {:.3f})".format(scores.mean(),
scores.std()),
end="\n\n" )

producing...

-- Testing best parameters [Grid]...
mean: 0.967 (std: 0.033)

Next, let's use the code from the previous post (also provided above) to get psuedo-code for best tree:

print("\n-- get_code for best parameters [Grid]:", end="\n\n")
dt_ts_gs.fit(X,y)
get_code(dt_ts_gs, features, targets)

producing...

-- get_code for best parameters [Grid]:

if ( PetalWidth <= 0.800000011921 ) {
return Iris-setosa ( 50 examples )
}
else {
if ( PetalWidth <= 1.75 ) {
if ( PetalLength <= 4.94999980927 ) {
if ( PetalWidth <= 1.65000009537 ) {
return Iris-versicolor ( 47 examples )
}
else {
return Iris-virginica ( 1 examples )
}
}
else {
return Iris-versicolor ( 2 examples )
return Iris-virginica ( 4 examples )
}
}
else {
return Iris-versicolor ( 1 examples )
return Iris-virginica ( 45 examples )
}
}

We can also make a graphic of the Decision Tree:

visualize_tree(dt_ts_gs, features, fn="grid_best")

resulting in

Image of the best decision tree using grid search

application of random search

Next, we try the random search method for finding parameters. In this case the dictionary param_dist has keys that are the parameter names (as before) and values that are (i) a list to be sampled from, or (ii) a distribution to be sampled from-- again, make sure the parameter names are valid and the distributions produce values that are sensible for the method being tested. In this example I use 288 samples so that the number of parameter settings tested is the same as the grid search above:

print("-- Random Parameter Search via 10-fold CV")

# dict of parameter list/distributions to sample
param_dist = {"criterion": ["gini", "entropy"],
"min_samples_split": randint(1, 20),
"max_depth": randint(1, 20),
"min_samples_leaf": randint(1, 20),
"max_leaf_nodes": randint(2, 20)}

dt = DecisionTreeClassifier()
ts_rs = run_randomsearch(X, y, dt, param_dist, cv=10,
n_iter_search=288)

producing...

-- Random Parameter Search via 10-fold CV

RandomizedSearchCV took 1.52 seconds for 288 candidates parameter
settings.
Model with rank: 1
Mean validation score: 0.967 (std: 0.033)
Parameters: {'min_samples_split': 12, 'max_leaf_nodes': 5,
'criterion': 'gini', 'max_depth': 19, 'min_samples_leaf': 1}

Model with rank: 2
Mean validation score: 0.967 (std: 0.033)
Parameters: {'min_samples_split': 1, 'max_leaf_nodes': 6, 'criterion':
'gini', 'max_depth': 11, 'min_samples_leaf': 1}

Model with rank: 3
Mean validation score: 0.967 (std: 0.033)
Parameters: {'min_samples_split': 4, 'max_leaf_nodes': 5, 'criterion':
'gini', 'max_depth': 15, 'min_samples_leaf': 1}

As with the grid search, this typically finds multiple parameter settings with a mean accuracy 0.967, or 96.7%. As we did above, the parameters for the best cross-validation are:

print("\n-- Best Parameters:")
for k, v in ts_rs.items():
print("parameters: {:<20s} setting: {}".format(k, v))

producing...

-- Best Parameters:
parameters: min_samples_split setting: 12
parameters: max_leaf_nodes setting: 5
parameters: criterion setting: gini
parameters: max_depth setting: 19
parameters: min_samples_leaf setting: 1

And, we can test the best parameters again:

# test the retuned best parameters
print("\n\n-- Testing best parameters [Random]...")
dt_ts_rs = DecisionTreeClassifier(**ts_rs)
scores = cross_val_score(dt_ts_rs, X, y, cv=10)
print("mean: {:.3f} (std: {:.3f})".format(scores.mean(),
scores.std()),
end="\n\n" )

producing...

-- Testing best parameters [Random]...
mean: 0.967 (std: 0.033)

To see what the Decision Tree is like, we can generate the pseudo-code for best random search result

print("\n-- get_code for best parameters [Random]:")
dt_ts_rs.fit(X,y)
get_code(dt_ts_rs, features, targets)

producing...

-- get_code for best parameters [Random]:
if ( PetalLength <= 2.45000004768 ) {
return Iris-setosa ( 50 examples )
}
else {
if ( PetalWidth <= 1.75 ) {
if ( PetalLength <= 4.94999980927 ) {
if ( PetalWidth <= 1.65000009537 ) {
return Iris-versicolor ( 47 examples )
}
else {
return Iris-virginica ( 1 examples )
}
}
else {
return Iris-versicolor ( 2 examples )
return Iris-virginica ( 4 examples )
}
}
else {
return Iris-versicolor ( 1 examples )
return Iris-virginica ( 45 examples )
}
}

and visualize the tree

visualize_tree(dt_ts_rs, features, fn="rand_best")

producing:

Image of the best decision tree using random search

Wrapping Up

So, we've used grid and random search with cross-validation to tune the parameters for our Decision Tree. In both cases there were marginal improvements, from 96% to 96.7%. Of course, this effect can be much larger in more complicated problems. A few final notes:

  • After finding the best parameter settings through cross-validation search it is typical to train the model with all of the data, using the best parameters found.
  • The conventional wisdom is that random search is more efficient than grid search for practical application. This certainly makes sense in cases where there are many parameters and features to test and grid search is really (computationally) difficult-- takes too long.
  • The basic cross-validation ideas developed here can be applied to many other scikit-learn models-- Random Forests, logistic regression, SVM, etc. To do this you simply need to spend some time learning about the model parameters, as well as the range of sensible values, to setup your own grid/random search-- I'd love to see it if you do this!

So, that's it. If you have comments or questions please leave them below-- links to code, web pages, etc are also very much appreciated by me and other readers. As always, reports of corrections/typos are also welcome. Enjoy!