Learning with Rule Ensembles

At GA-CCRi we are interested in all kinds of learning algorithms, from generalized linear models to random forests. One of the models that we have recently been working on is called a rule ensemble and aims to provide deeper insight into how a model produces a prediction while still maintaining predictive accuracy. As discussed by Breiman [1] many of the most accurate machine learning models are often those that try to emulate a natural process like the brain (neural networks, random forests) and are often a kind of black box to those who are running them.

In our learning models, we think it is important to know which properties of the input variables are important when classifying an observation. The measure of how much a particular facet of a piece of data contributes to how a model makes a decision on that piece of data is what we call predictor or feature importance. For example, when predicting whether a patient with hepatitis will survive or not there are many things that could be predictive, things like gender, age, properties of the liver, or various blood protein levels. If it turns out that for all ages and genders the patients who died all had firm livers and those who did not had softer livers, we as humans might think that having a firm liver is an important feature. Keeping this in mind, we can create machine learning models that can tell us, in a way similar to how a human could, what inputs seem to influence the output class.

In most implementations of random forests, including GA-CCRi’s online version, it is possible to get a number rating how important a feature is when deciding how to classify an observation. Logistic regression also gives a simple weighting for each feature that is mostly analogous to a random forest feature importance score. A higher number for a feature means the feature is more important, but this isn’t always enough. This number doesn’t capture what values of a feature actually matter. If the values of a particular feature cover a wide range but only a very small part of that range matters it is hard for a human observer to pick those out just by looking at the raw data values. Another issue is when there are interaction effects between features. Perhaps only at older ages do symptoms of fatigue influence the survival rate of a patient; the two features separately may not have the highest importance but are highly predictive when considered together.

To alleviate these issues in our models we construct series of rules that can capture many of these interactions in a human understandable way. In order to produce these rules quickly we take advantage of decision trees and build a random forest. By looking at each of the trees and traversing from the root node to one of the descendant nodes, we extract our rules. Each traversal becomes a conjunction of a set of simple boolean properties that can be evaluated against any set of inputs in our data set, i.e. for a data point x with N features, x(i) being the i-th feature, a property may be x(i) greater than 100 and a rule could be

x(2) > 100 && x(7) contained in {“small”, “medium”}

To make sure we are creating rules that cover all aspects of the data, we create many trees and take advantage of the randomness in the random forest. Keeping the trees relatively short also keeps the rules readable.

Once we have extracted all of our rules (sometimes over 5000!), for each input observation we convert the original set of features into a new one by evaluating each observation on every rule. This gives us a feature space with a dimensionality that is often much larger than the number of observations in our data set. To perform predictions and avoid massively over-fitting our predictions we run our converted observations through regularized logistic regression. When the correct regularization parameter is used we see that the weights for all but about 10% of our rules become zero, meaning that they have no impact on our final prediction.

When we run our rule ensemble on the same hepatitis data discussed by Breiman we get some promising results. Averaging over 100 trials and withholding a randomized 10% of the data for testing while using the rest for training we see an F-score of 0.897. For every trial we first create a random forest and then extract several thousand rules as outlined above. These are fed into a logistic regression with stochastic gradient descent and we end with a reduced set of rules, as seen below.

Sample Rules for Hepatitis Model

BILIRUBIN <= 3.5 && ALBUMIN > 2.8

ASCITES not contained in Set(no) && BILIRUBIN <= 1.6

ALK PHOSPHATE <= 256.0 && BILIRUBIN > 0.5 && ALBUMIN > 2.8

We are however sacrificing some accuracy for this improved interpretability. Predicting with just a random forest gives an F-score of 0.915. It is promising to note that the rule ensemble creates rules that contain many of the same features which random forests rank as important but a standard logistic regression fails to identify.

In a future post we hope to discuss how we can use rule ensembles to predict on relational data similar to what was discussed previously by Nick and Tim. For easy viewing and exploration we have begun to implement a web-app we are calling Theseus. Soon Theseus will be able to answer questions from easily pluggable datasets of graph data, and right now we are able to use Theseus to display our hepatitis predictions.