The behavior you are observing with the binary input features in XGBoost is due to how split conditions are decided during the tree construction process.
In the context of XGBoost, when a split is made on a binary feature, it actually checks if the feature value is less than a certain threshold. This threshold may be influenced by the internal representation of the feature value and how it's used in the decision-making process within XGBoost.
While binary features technically have values of 0 and 1, XGBoost may internally use numerical thresholds for decision making that do not exactly match these values. This can lead to seemingly strange conditions like " < -9.53674e-07" when visualizing the individual trees.
In reality, during prediction, a binary feature value of 0 will be interpreted as not passing the split condition (value < threshold), and a value of 1 will pass the split condition. The specific threshold values or conditions are more about splitting the data optimally to minimize impurity rather than representing the raw binary feature values directly.
Therefore, when interpreting the binary input features in XGBoost trees, focus on the conceptual understanding that the splits are based on comparing feature values against thresholds, even if the specific threshold values seem unusual when viewed in isolation.
xgboost.plot_tree: binary feature interpretation
asked 7 months ago
1 Answers
9 Views
I've built an XGBoost model and seek to examine the individual estimators. For reference, this was a binary classification task with discrete and continuous input features. The input feature matrix is a scipy.sparse.csr_matrix
.
When I went to examine an individual estimator, however, I found difficulty interpreting the binary input features, such as f60150
below. The real-valued f60150
in the bottommost chart is easy to interpret - its criterion is in the expected range of that feature. However, the comparisons being made for the binary features, <X> < -9.53674e-07
doesn't make sense. Each of these features is either 1 or 0. -9.53674e-07
is a very small negative number, and I imagine this is just some floating-point idiosyncrasy within XGBoost or its underpinning plotting libraries, but it doesn't make sense to use that comparison when the feature is always positive. Can someone help me understand which direction (i.e. yes, missing
vs. no
corresponds to which true/false side of these binary feature nodes?
Here is a reproducible example:
import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt
def booleanize_csr_matrix(mat):
''' Convert sparse matrix with positive integer elements to 1s '''
nnz_inds = mat.nonzero()
keep = np.where(mat.data > 0)[0]
n_keep = len(keep)
result = scipy.sparse.csr_matrix(
(np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
shape=mat.shape
)
return result
### Setup dataset
res = fetch_20newsgroups()
text = res.data
outcome = res.target
### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)
# Whether to "booleanize" the input matrix
booleanize = True
# Whether to, after "booleanizing", convert the data type to match what's returned by `vec.fit_transform(text)`
to_int = True
if booleanize and to_int:
X = booleanize_csr_matrix(X)
X = X.astype(np.int64)
# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)
# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()
Heck, even if I do a really simple example, I get the "right" results, regardless of whether X
or y
are integer or floating types.
X = np.matrix(
[
[1,0],
[1,0],
[0,1],
[0,1],
[1,1],
[1,0],
[0,0],
[0,0],
[1,1],
[0,1]
]
)
y = np.array([1,0,0,0,1,1,1,0,1,1])
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()