Ensembles are nothing new, of course; they underlie many of the most popular machine learning algorithms (e.g., random forests and generalized boosted models) . The theory is that consensus opinions from diverse modeling techniques are more reliable than potentially biased or idiosyncratic predictions from a single source. More broadly, this principle is as basic as “two heads are better than one.” It’s why cancer patients get second opinions, why the Supreme Court upheld affirmative action, why news organizations like MSNBC and Fox hire journalists with a wide variety of political leanings…

Well, maybe the principle isn’t *universally* applied. Still, it is fundamental to many disciplines and holds enormous value for the data scientist. Below, I will explain why, by addressing the following:

**Ensembles add value:**Using ‘real world’ evidence from Kaggle and GigaOM‘s recent WordPress Challenge, I’ll show how very basic ensembles – both within my own solution and, more interestingly, between my solution and the second-place finisher – added significant improvements to the result.**Why ensembles add value:**For the non-expert (like me), I’ll give a brief explanation of why ensembles add value and support overkill analytics’ chief goal: leveraging computing scale productively.**How ensembles add value:**Also for the non-expert, I’ll run through a ‘toy problem’ depiction of how a very simple ensemble works, with some nice graphs and R code if you want to follow along.

Most of this very long post may be extremely rudimentary for seasoned statisticians, but rudimentary is the name of the game at Overkill Analytics. If this is old hat for you, I’d advise just reading the first section – which has some interesting real world results – and glancing at the graphs in the final section (which are very pretty and are useful teaching aids on the power of ensemble methods).

**Ensembles Add Value: Two Data Geeks Are Better Than One**

Overkill analytics is built on the principle that ‘more is always better’. While it is exciting to consider the ramifications of this approach in the context of massive Hadoop clusters with thousands of CPUs, sometimes overkill analytics can require as little as adding an extra model when the ‘competition’ (a market rival, a contest entrant, or just the next best solution) uses only one. Even more simply, overkill can just mean leveraging results from two independent analysts or teams rather than relying on predictions from a single source.

Below is some evidence from my recent entry in the WordPress Challenge on Kaggle. (Sorry to again use this as an example, but I’m restricted from using problems from my day job.). In my entry, I used a basic ensemble of a logistic regression and a random forest model – each with essentially the same set of features/inputs – to produce the requested recommendation engine for blog content. From the chart below, you can see how this ensemble performed in the evaluation metric (mean average precision @ 100) at various ‘weight values’ for the ensemble components:

Note that either model used independently as a recommendation engine – either a random forest solution or a logistic regression solution – would have been insufficient to win the contest. However, taking an average of the two produced a (relatively) substantial margin over the nearest competitor. While this is a fairly trivial example of ensemble learning, I think it is significant evidence of the competitive advantages that can be gained from adding just a little scale.

For a more interesting example of ensemble power, I ran the same analysis with an ensemble of my recommendation engine and the entry of the second place finisher (Olexandr Topchylo). To create the ensemble, I used nothing more complicated than a ‘college football ranking’ voting scheme (officially a Borda count): assign each prediction a point value equal to its inverse rank for the user in questions. I then combined the votes at a variety of different weights, re-ranked the predictions per user, and plotted the evaluation metric:

By combining the disparate information generated by my modeling approach and Dr. Topchlyo’s, one achieves a predictive result superior to that of any individual participant by 1.2% – as large as the winning margin in the competition. Moreover, no sophisticated tuning of the ensemble is required – a 50/50 average comes extremely close to the optimal result.

Ensembles are a cheap and easy technique that allows one to combine the strengths of disparate data science techniques or even disparate data scientists. Ensemble learning showcases the real power of Kaggle as a platform for analytics – clients receive not only the value of the ‘best’ solution, but also the superior value of the combined result. It can even be used as a management technique for data science teams (e.g., set one group to work collaboratively on a solution, set another to work independently and combine results in an ensemble, and compare – or combine – the results of the two teams). Any way you slice it, the core truth is apparent: two data geeks are always better than one.

So what does this example tell us about how ensembles can be used to leverage large-scale computing resources to achieve faster, better, and cheaper predictive modeling solutions? Well, it shows (in a very small way) that computing scale should be used to make predictive models broader rather than deeper. By this, I mean that scale should be used first to expand the variety of features and techniques applied to a problem, not to delve into a few sophisticated and computationally intensive solutions.

The overkill approach is to abandon sophisticated modeling approaches (at least initially) in favor of more brute force techniques. However, *unconstrained* brute force in predictive modeling is a recipe for disaster. For example, one could just use extra processing scale to exhaustively search a function space for the model that best matches available data, but this would quickly fail for two reasons:

- exhaustive searches of large, unconstrained solution spaces are computationally impossible, even with today’s capacity; and
- even when feasible, searching purely on the criteria of best match will lead to overfit solutions: models which overlearn the noise in a training set and are therefore useless with new data.

At the risk of overstatement, the entire field of predictive modeling (or data science, if you prefer) is to address these two problems – i.e., to find techniques which search a narrow solution space (or narrowly search a large solution space) for productive answers, and to find search criteria that discover real solutions rather than just learning noise.

So how can we overkill without the overfit? That’s where ensembles come in. Ensemble learning uses combined results from multiple different models to provide a potentially superior ‘consensus’ opinion. It addresses both of the problems identified above:

- Ensemble methods have broad solution spaces (essentially ‘multiplying’ the component search spaces) but search them narrowly – trying only combinations of ‘best answers’ from the components.
- Ensembles methods avoid overfitting by utilizing components that read irrelevant data differently (canceling out noise) but read relevant inputs similarly (enhancing underlying signals).

It is a simple but powerful idea, and it is crucial to the overkill approach because it allows the modeler to appropriately leverage scale: use processing power to create large quantities of weak predictors which, in combination, outperform more sophisticated methods.

The key to ensemble methods is selecting or designing components with independent strengths and a true diversity of opinion. If individual components add little information, or if they all draw the same conclusion from the same facts, the ensemble will add no value and even reinforce errors from individual components. (If a good ensemble is like a panel of diverse musical faculty using different criteria to select the best students for Juliard, a bad ensemble would be like a mob of tween girls advancing Sanjaya on American Idol.) In order to succeed, the ensemble’s components must excel at finding different types of signals in your data while having varied, uncorrelated responses to your data’s noise.

As shown in the first section of this post, a small but effective example of ensemble modeling is to take an average of classification results from a logistic regression and a random forest. These two techniques are good complements for an ensemble because they have very different strengths:

- Logistic regressions find broad relationships between independent variables and predicted classes, but without guidance they cannot find non-linear signals or interactions between multiple variables.
- Random forests (themselves an ensemble of classification trees) are good at finding these narrower signals, but they can be overconfident and overfit noisy regions in the input space.

Below is a walkthrough applying this ensemble to a toy problem – finding a non-linear classification signal from a data set containing the class result, the two relevant inputs, and two inputs of pure noise.

The signal for our walk-through is a non-linear equation of two variables dictating the probability of vector X belongs to a class C:

The training data uses the above signal to determine class membership for a sample of 10,000 two-dimensional X vectors (x1 and x2). The dataset also includes two irrelevant random features (x3 and x4) to make the task slightly more difficult. Below is R code to generate the training data and produce two maps showing the signal we are seeking as well as the training data’s representation of that signal:

# packages require(fields) # for heatmap plot with legend require(randomForest) # requires installation, for random forest models random.seed(20120926) # heatmap wrapper, plotting func(x, y) over range a by a hmap.func <- function(a, f, xlab, ylab) { image.plot(a, a, outer(a, a, f), zlim = c(0, 1), xlab = xlab, ylab = ylab) } # define class signal g <- function(x, y) 1 / (1 + 2^(x^3+ y+x*y)) # create training data d <- data.frame(x1 = rnorm(10000), x2 = rnorm(10000) ,x3 = rnorm(10000), x4 = rnorm(10000)) d$y = with(d, ifelse(runif(10000) < g(x1, x2), 1, 0)) # plot signal (left hand plot below) a = seq(-2, 2, len = 100) hmap.func(a, g, "x1", "x2") # plot training data representation (right hand plot below) z = tapply(d$y, list(cut(d$x1, breaks = seq(-2, 2, len=25)) , cut(d$x2, breaks = seq(-2, 2, len=25))), mean) image.plot(seq(-2, 2, len=25), seq(-2, 2, len=25), z, zlim = c(0, 1) , xlab = "x1", ylab = "x2")

Signal In x1, x2 |
Training Set in x1, x2 |

As you can see, the signal is somewhat non-linear and is only weakly represented by the training set data. Thus, it presents a reasonably good test for our sample ensemble.

The next batch of code and R maps show how each ensemble component, and the ensemble itself, interpret the relevant features:

# Fit log regression and random forest fit.lr = glm(y~x1+x2+x3+x4, family = binomial, data = d) fit.rf = randomForest(as.factor(y)~x1+x2+x3+x4, data = d, ntree = 100, proximity = FALSE) # Create funtions in x1, x2 to give model predictions # while setting x3, x4 at origin g.lr.sig = function(x, y) predict(fit.lr, data.frame(x1 = x, x2 = y, x3 = 0, x4 = 0), type = "response") g.rf.sig = function(x, y) predict(fit.rf, data.frame(x1 = x, x2 = y, x3 = 0, x4 = 0), type = "prob")[, 2] g.en.sig = function(x, y) 0.5*g.lr.sig(x, y) + 0.5*g.rf.sig(x, y) # Map model predictions in x1 and x2 hmap.func(a, g.lr.sig, "x1", "x2") hmap.func(a, g.rf.sig, "x1", "x2") hmap.func(a, g.en.sig, "x1", "x2")

Log. Reg. in x1, x2 |
Rand. Forest In x1, x2 |
Ensemble in x1, x2 |

Note how the logistic regression makes a consistent but incomplete depiction of the signal – finding the straight line that best approximates the answer. Meanwhile, the random forest captures more details, but it is inconsistent and ‘spotty’ due to its overreaction to classification noise. The ensemble marries the strengths of the two, filling in some of the ‘gaps’ in the random forest depiction with steadier results from the logistic regression.

Similarly, ensembling with a logistic regression helps wash out the random forest’s misinterpretation of irrelevant features. Below is code and resulting R maps showing the reaction of the models to the two noisy inputs, x3 and x4:

# Create funtions in x3, x4 to give model predictions # while setting x1, x2 at origin g.lr.noise = function(x, y) predict(fit.lr, data.frame(x1 = 0, x2 = 0, x3 = x, x4 = y), type = "response") g.rf.noise = function(x, y) predict(fit.rf, data.frame(x1 = 0, x2 = 0, x3 = x, x4 = y), type = "prob")[, 2] g.en.noise = function(x, y) 0.5*g.lr.noise(x, y) + 0.5*g.rf.noise(x, y) # Map model predictions in noise inputs x3 and x4 hmap.func(a, g.lr.noise, "x3", "x4") hmap.func(a, g.rf.noise, "x3", "x4") hmap.func(a, g.en.noise, "x3", "x4")

Log. Reg. Prediction in x3, x4 |
Rand. Forest Prediction In x3, x4 |
Ensemble Prediction in x3, x4 |

As you can see, the random forest reacts relatively strongly to the noise, while the logistic regression is able to correctly disregard the information. In the ensemble, the logistic regression cancels out some of the overfitting from the random forest on the irrelevant features, making them less critical to the final model result.

Finally, below is R code and a plot showing a classification error metric (cross-entropy error) on a validation data set for various ensemble weights:

# (Ugly) function for measuring cross-entropy error cross.entropy <- function(target, predicted) { predicted = pmax(1e-10, pmin(1-1e-10, predicted)) - sum(target * log(predicted) + (1 - target) * log(1 - predicted)) } # Creation of validation data dv <- data.frame(x1 = rnorm(10000), x2 = rnorm(10000) , x3 = rnorm(10000), x4 = rnorm(10000)) dv$y = with(dv, ifelse(runif(10000) < g(x1, x2), 1, 0)) # Create predicted results for each model dv$y.lr <- predict(fit.lr, dv, type = "response") dv$y.rf <- predict(fit.rf, dv, type = "prob")[, 2] # Function to show ensemble cross entropy error at weight W for log. reg. error.by.weight <- function(w) cross.entropy(dv$y, w*dv$y.lr + (1-w)* dv$y.rf) # Plot + pretty plot(Vectorize(error.by.weight), from = 0.0, to = 1, xlab = "ensemble weight on logistic regression", ylab = "cross-entropy error of ensemble", col = "blue") text(0.1, error.by.weight(0)-30, "Random\nForest\nOnly") text(0.9, error.by.weight(1)+30, "Logistic\nRegression\nOnly")

The left side of the plot is a random forest alone, and has the highest error. The right side is a logistic regression alone, and has somewhat lower error. The curve shows ensembles with varying weights for the logistic regression, most of which outperform either candidate model.

Note that our most unsophisticated of ensembles – a simple average – achieves almost all of the potential ensemble gain. This is consistent with the ‘real world’ example in the first section. Moreover, it is a true overkill analytics result – a cheap and obvious trick that gets almost all of the benefit of a more sophisticated weighting scheme. It also exemplifies my favorite rule of probability, espoused by poker writer Mike Caro: in the absence of information, everything is fifty-fifty.

Super long post, I realize, but I really appreciate anyone who got this far. I plan one more post next week on the WordPress Challenge, which will describe the full list of features I used to create the ensemble model and a brief (really) analysis of the relative importance of each.

As always, thanks for reading!

]]>Forgive me for the self-indulgence, but here’s a quick excerpt:

And therein lies the beauty of overkill analytics, a term that Carter might have coined, but that appears to be catching on — especially in the world of web companies and big data. Carter says he doesn’t want to spend a lot of time fine-tuning models, writing complex algorithms or pre-analyzing data to make it work for his purposes. Rather, he wants to utilize some simple models, reduce things to numbers and process the heck out of the data set on as much hardware as is possible.

I’m pleased GigaOM found the overkill analytics characterization worth discussing. It’s brought a host of visitors to the site (at least compared to the readership I thought I’d have).

With the additional visits, though, I feel I should put a little more meat on the bones about my development philosophy. I’m sure this will be familiar territory for seasoned data scientists, but below are four principles that guide my approach:

**Spend most of your time engineering mass quantities of features.**Synthesizing raw data about the subject into pertinent, lower-dimensional metrics always brings the biggest bang for your buck. More features with more diversity are always better, even if some are simplistic or unsophisticated.**Spend very little of your time comparing, selecting, and fine-tuning models.**A simple ensemble of many crude models is usually better than a perfectly-calibrated ensemble of a more precise models.**Spend no time making your algorithm elegant, optimized, or theoretically sound.**Use cheap servers and cheap tricks instead.**Get results first; explain them later.**The statistical algorithms available are powerful and unbiased – they will find the key elements before you do. Your job is to feed them mass quantities of features and then explain and interpret what they find, not guide them to a preconceived intuition about the answer.

To an extent, this is just a restatement of some best practices in predictive modeling. However, where overkill analytics comes in is to take these principles to new and hopefully productive extremes by leveraging cheap cloud computing power and rapid development principles. It is in this aspect where I hope to offer some innovation and new approaches.

Anyway, thanks for the visits. I will publish a couple more posts next week on the WordPress Challenge, one describing and ranking the full set of features I used, and one on the power of simple ensembles to improve results on this type of real-world problem.

As always, thanks for reading.

]]>My general approach – consistent with my overkill analytics philosophy – was to abandon any notions of elegance and instead blindly throw multiple tactics at the problem. In practical terms, this means I hastily wrote ugly Python scripts to create data features, and I used oversized RAM and CPU from an Amazon EC2 spot instance to avoid any memory or performance issues from inefficient code. I then tossed all of the resulting features into a glm and a random forest, averaged the results, and hoped for the best. It wasn’t subtle, but it was effective. (Full code can be found here if interested.)

From my brief review of other winning entries, I believe one unique quality of my submission was its limited use of the WordPress social graph. (Fair warning: I may abuse the term ‘social graph,’ as it is not something I have worked with previously.) Specifically, a user ‘liking’ a blog post creates a link (or edge) between user nodes and blog nodes, and these links construct a graph connecting users to blogs outside their current reading list:

Defining user-blog relationships by this ‘like graph’ opens up a host of available tools and metrics used for social networking and other graph-related problems.

The simplest of these graph metrics is the concept of node distance within graphs. In this case, node distance is the smallest number of likes required to traverse between a particular user node and a particular blog node. In the diagram above, for example, User A and Blog 4 have a node distance of 3, while User C and Blog 5 have a distance of 5.

The chart below breaks down likes from the last week of the final competition training data (week 5) by the node distance between the user and the liked blog within their prior ‘like graph’ (training data weeks 1-4):

As you can see, nearly 50% of all new likes are from blogs one ‘edge’ from the user – i.e., blogs the user had already liked in the prior four weeks. These ‘like history’ blogs are a small, easily manageable population for a recommendation engine, and there are many relevant features that can be extracted based on the user’s history with the blog. Therefore, the like history set was the primary focus of most contest submissions (including mine).

However, expanding the search for candidates one more level – to a distance of 3 edges/likes traversed – encompasses 90% of all new likes. A ‘distance 3’ blog wold be a blog that is not in the subject’s immediate like history but that is in the history of another user who had liked at least one blog in common with the subject. This is admittedly a large space (see below), but I think it significant that >90% of a user’s liked blogs in a given week can be found by traversing through just one common reader in the WordPress graph. Finding the key common readers and common blogs, therefore, is a promising avenue for finding recommendation candidates that are new to the subject user.

As referenced above, the main problem with using the distance 3 blogs as recommendation candidates is that the search space is far too large – most users have tens of thousands of blogs in their distance 3 sets:

As seen from the above chart, while a significant portion of users (~20%) have a manageable distance 3 blog set (1,000 to 2,000 blogs), the vast majority have tens of thousands of blogs within that distance. (Some post-competition inspection shows that many of these larger networks are caused by a few ‘hyper-active’ users in the distance 3 paths. Eliminating these outliers could be a reasonable way to create a more compact distance 3 search set.)

One could just ignore the volume issues and run thousands of distance 3 blog candidates per user through the recommendation engine. However, calculating the features and training the models for this many candidate blogs would be computationally intractable (even given my inclination for overkill). To get a manageable search space, one needs to find a basic, easily calculable feature that identifies the most probable liked blogs in the set.

The metric I used was one designed to represent node centrality, a measure of how important a node is within a social graph. There are many sophisticated, theoretically sound ways to measure node centrality, but implementing them would have required minutes of exhaustive wikipedia reading. Instead, I applied a highly simplified calculation designed to measure a blog node’s three-step centrality within a specific user’s social graph:

- Step 1(a): Calculate all three-step paths from the subject user in the graph (counting multiple likes between a user and blog and multiple possible paths);
- Step 1(b): Aggregate the paths by the end-point blog; and
- Step 1(c): Divide the aggregated paths by the total paths in step 1(a).
- Step 2: ???
- Step 3: Profit.

The metric is equivalent to the probability of reaching a blog in three steps from the subject user, assuming that at each outbound like/edge has an equal probability of being followed. It is akin to Google’s original PageRank, except only the starting user node receives an initial ‘score’ and only three steps are allowed when traversing the graph.

I don’t know if this is correct or theoretically sound, but it worked reasonably well for me – substantially lifting the number of likes found when selecting candidates from the distance 3 graph:

As shown above, if you examine the first 500 distance 3 blogs by this node centrality metric, you can find over 20% of all the likes in the distance 3 blog set. If you selected 500 candidates by random sample, however, only 3% of the likes from this population would be found. While I am certain this metric could be improved greatly by using more sophisticated centrality calculations, the algorithm above serves as a very useful first cut.

I’d feel remiss not putting any code in this post. Unfortunately, there was a lot of bulky data handling code I used in this competition to get to the point where I could run the analysis above, so posting the code that produced this data would require a lot of extra files. I’d happily send it all to anyone interested, of course, just e-mail me.

However, in the interest of providing something, below is a quick node distance algorithm in Python that I used after the fact to calculate node distances in the like graph. This is just a basic breadth-first search implemented in Python, with the input graph represented as a dictionary with node names as keys and sets of connected node names as values:

def distance(graph, start, end): # return codes for cases where either the start point # or end point are not in the graph at all if start not in graph: return -2 if end not in graph: return -3 # set up a marked dictionary to identify nodes already searched marked = dict((k, False) for k in graph) marked[start] = True # set a FIFO queue (just a python list) of (node, distance) tuples

queue = [(start, 0)] # as long as the queue is full... while len(queue): node = queue.pop(0) # if the next candidate is a match, return the candidate's distance if node[0] == end: return node[1] # otherwise, add all the nodes connected to the candidate if not already searched # mark them as searched (added to queue) and associate them with candidate distance + 1 else: nextnodes = [nn for nn in graph.get(node[0], set()) if marked[nn] == False] queue.extend((nn, node[1]+1) for nn in nextnodes) marked.update(dict((nn, True) for nn in nextnodes)) # if you fall through, return a code to show no connection return -1

In the end, this node centrality calculation served as a feature in my recommendation engine’s ranking and – more importantly – as a method of identifying the candidates to be fed into the engine. I have not done the work to see how much this feature and selection method added to my score, but I know as a feature it added 2%-3% to my validation score, a larger jump than many other features. Moreover, my brief review of the other winner’s code leads me to think this may have been a unique aspect of my entry – many of the other features I used were closely correlated elements in the other’s code.

More crucially, for actual implementation by Automattic the ‘like graph’ is a more practical avenue for a recommendation engine, and is probably what the company uses to recommend new posts. Most of the work we did in the competition differentiated between posts from blogs the user was already reading – useful, but not a huge value-add to the WordPress user. Finding *unseen* blog posts in which a user may have interest would be a more relevant and valuable tool, and finding new blogs for the user with high centrality in their social graph is a reasonable way to find them. From my work on the competition, I believe these methods would be more promising avenues than NLP and topic-oriented methods.

The above posts covers all of my thinking in applying network concepts to the WordPress challenge problem, but I am certain I only scratched the surface. There are a host of other metrics that make sense to apply (such as eccentricity, closeness and betweenness). If you work for WordPress/Automattic (and that is the only conceivable reason you made it this far), I’d be happy to discuss additional ideas, either on this blog or in person.

]]>If you look around, you’ll see pretty quickly this is my first blog. There are lots of WordPress default settings and content, and everything looks pretty bland. While I love learning new technologies, I’m honestly not that into web design – my style of choice is still white (or better green) text on a black terminal. So please, bear with me as I learn to make this site at least marginally professional.

For now, please read the about this site and about me page, and visit back soon for actual content. Thanks!

]]>