Over a million developers have joined DZone.

Treatment of Categorical Variables in H2O's DRF Algorithm

DZone 's Guide to

Treatment of Categorical Variables in H2O's DRF Algorithm

This post answers some of the important questions related to the automated way of handling categorical variables in H2O algorithms.

· AI Zone ·
Free Resource

In DRFs (Distributed Random Forests), categorical_encoding is exposed. The explanation is here.

Q: What is the meaning of AUTO (let the algorithm decide) in DRF?

A: Based on this link, GBM/DRF/k-means can use auto or AUTO, which refers to allowing the algorithm to decide (by default). For GBM, DRF, and k-means, the algorithm will perform Enum encoding when the auto option is specified.

Q: Could you explain how Eigen encoding works? Do you have you a good online reference?

A: With this, there are k columns per categorical feature, keeping projections of one hot-encoded matrix onto the k-dim Eigenspace. Eigen only uses k=1 for now.

Q: Do you have any recommended techniques for randomizing the ordering of the categoricals? Let’s say that the categoricals are U.S. states and that large discriminative power comes from separating Alabama and Alaska, but no discrimination comes from separating {AL, AK} from the rest. With nbins_cat set to 5, it is likely that the grouping for {AL, AK} versus the other states will never be selected?

A: Leave the dataset as is, internally map the strings to integers, and use these integers to make splits either via ordinal nature when nbins_cats is too small to resolve all levels or via bitsets that do a perfect group split.

Q: What is meant by via bitsets that do a perfect group split? I have noticed this in the model POJO output, but I can't find this behavior documented. If the categories are letters of the English alphabet (A=1, …, Z=26), groups can be split by appropriate bags of letters (e.g. a split might send A, E, F, and X one way with other letters and NA going the other way). It can't do an exhaustive search over all possible combinations of letters A to Z to form the optimal group. 

A: If nbins_cat is 5 for 52 categorical levels, there won’t be any bitsets used for splitting the categorical levels. Instead, nbins_cats (5) split points will be considered for splitting the levels into:

  • {A … D} left vs. {E … Z} right.
  • {A … M} left vs {N … Z} right.

The five split points are uniformly spaced across all levels present in the node (at the root, that’s A … Z); those are simple “less-than” splits in the integer space of the levels.

If one of the nbins_cat splits ends up being the best split for the given tree node (across all selected columns of the data), then the next level split decision will have fewer levels to split, and so on. For example, the left node might contain only {A … D} (assuming the first split point was chosen above).

This smaller set of levels will be able to be resolved with nbins_cat = 5, and then a bitset split is created that looks like this:

  • A: Left.
  • B: Right.
  • C: Right.
  • D: Left.

Yes, this is optimal (for every level, we know the training data behavior and can choose to send the points left or right), but without doing an exhaustive search.

The point here is that nbins_cat is an important tuning parameter as it will lead to “perfect” splits once it’s big enough to resolve the categorical levels. Otherwise, you have to hope that the “less-than” splits will lead to good-enough separation to eventually get to the perfect bitsets.

ai ,machine learning ,algorithms ,drf algorithm ,categorical variables

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}