Treatment of Categorical Variables in H2O's DRF Algorithm
This post answers some of the important questions related to the automated way of handling categorical variables in H2O algorithms.
Join the DZone community and get the full member experience.
Join For FreeIn DRFs (Distributed Random Forests), categorical_encoding
is exposed. The explanation is here.
Q: What is the meaning of AUTO (let the algorithm decide) in DRF?
A: Based on this link, GBM/DRF/k-means can use auto or AUTO, which refers to allowing the algorithm to decide (by default). For GBM, DRF, and k-means, the algorithm will perform Enum encoding when the auto option is specified.
Q: Could you explain how Eigen encoding works? Do you have you a good online reference?
A: With this, there are k columns per categorical feature, keeping projections of one hot-encoded matrix onto the k-dim Eigenspace. Eigen only uses k=1 for now.
Q: Do you have any recommended techniques for randomizing the ordering of the categoricals? Let’s say that the categoricals are U.S. states and that large discriminative power comes from separating Alabama and Alaska, but no discrimination comes from separating {AL, AK} from the rest. With nbins_cat
set to 5, it is likely that the grouping for {AL, AK} versus the other states will never be selected?
A: Leave the dataset as is, internally map the strings to integers, and use these integers to make splits either via ordinal nature when nbins_cats
is too small to resolve all levels or via bitsets that do a perfect group split.
Q: What is meant by via bitsets that do a perfect group split? I have noticed this in the model POJO output, but I can't find this behavior documented. If the categories are letters of the English alphabet (A=1, …, Z=26), groups can be split by appropriate bags of letters (e.g. a split might send A, E, F, and X one way with other letters and NA going the other way). It can't do an exhaustive search over all possible combinations of letters A to Z to form the optimal group.
A: If nbins_cat
is 5 for 52 categorical levels, there won’t be any bitsets used for splitting the categorical levels. Instead, nbins_cats
(5) split points will be considered for splitting the levels into:
- {A … D} left vs. {E … Z} right.
- {A … M} left vs {N … Z} right.
The five split points are uniformly spaced across all levels present in the node (at the root, that’s A … Z); those are simple “less-than” splits in the integer space of the levels.
If one of the nbins_cat
splits ends up being the best split for the given tree node (across all selected columns of the data), then the next level split decision will have fewer levels to split, and so on. For example, the left node might contain only {A … D} (assuming the first split point was chosen above).
This smaller set of levels will be able to be resolved with nbins_cat
= 5, and then a bitset split is created that looks like this:
- A: Left.
- B: Right.
- C: Right.
- D: Left.
Yes, this is optimal (for every level, we know the training data behavior and can choose to send the points left or right), but without doing an exhaustive search.
The point here is that nbins_cat
is an important tuning parameter as it will lead to “perfect” splits once it’s big enough to resolve the categorical levels. Otherwise, you have to hope that the “less-than” splits will lead to good-enough separation to eventually get to the perfect bitsets.
Published at DZone with permission of Avkash Chauhan, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments