January 6, 2009

The "Naploeon Dynamite Problem"

In pondering my return to active posting on this blog, I came back to this article from late November concerning the Netflix challenge. Josh wrote a bit about this competition some months back—basically Netflix has created an “open source competition” to see if someone can improve upon on the accuracy of their movie matching algorithm. When you select one title, Netflix suggests others—and they want to increase the accuracy that you will enjoy their recommendation based upon pre-existing selections/tastes.

The competition has become an intense “hobby” for many interested in data mining and analytics (Josh downloaded the data set to work on it as well), and the sharing of these results has produced an issue contestants are calling the “Napoleon Dynamite Problem.” Basically, Napoleon Dynamite is a movie most everyone who reviews it loves, or hates, and while that rating has strong predictive power, there is little discernible pattern between who would love or who would hate the movie. One of the strongest predictors in the data set is displaying an almost random distribution. In other words, this powerful predictor appears to be an outlier.

How should a contestant proceed? As a very popular movie which elicits strong predictive responses (love or hate, not just like or dislike) Napoleon Dynamite is a significant point in the Netflix data landscape. However, the lack of pattern between those with similar ratings has rendered contestants' models fuzzy, or worse.

This brought me back to issues I encounter almost daily in my own analytics work: how to deal with outliers. Whether it is building a predictive model, or creating simple algorithmic projections of future giving, there always seem to be a dialog between myself and clients regarding what should be included or excluded.

Consider Example 1:

Total Giving
FY04 $14,000,000
FY05 $16,000,000
FY06 $15,500,000
FY07 $15,800,000
FY08 $26,500,000


This demonstrates a common issue seen in fundraising: how do you account for large gifts in projections (dramatic increase in FY08)? If this was a realized planned gift, or possibly even a major gift, some would argue to exclude it to not erroneously affect future projections. The gift was made though right? Is FY08 giving sustainable? How accurate can projections of future giving be, if you exclude historical realized giving?

For Example 2, lets consider building a predictive model where you may run into issues with deceased records, especially in relatively “younger” institutions. You can produce a model on living records (they are the only constituents that can still give major gifts!), but what if half or more of the major gifts at an institution came from records flagged as deceased? Is it necessary to lose roughly 50% of your sample? Is your model inaccurately skewed for not considering donors, many of whom have a data-rich profile, who made major gifts when they were alive, but have since passed? Does inclusion of deceased records produce “generational” predictive phenomenon with only minor relevance to today’s living donor pool?

It is difficult to produce “rules” on outlier issues like these—many times decisions on how to approach these situations can be relative to a specific institution or project goals. Consider though, the “Napoleon Dynamites” in your work, and continue to experiment with ideas, and challenge your own work by creating new ways to utilize the data at your finger tips to answer your own questions.

If You Liked This, You’re Sure to Love That
By CLIVE THOMPSON
Published: November 21, 2008


THE “NAPOLEON DYNAMITE” problem is driving Len Bertoni crazy. Bertoni is a 51-year-old “semiretired” computer scientist who lives an hour outside Pittsburgh. In the spring of 2007, his sister-in-law e-mailed him an intriguing bit of news: Netflix, the Web-based DVD-rental company, was holding a contest to try to improve Cinematch, its “recommendation engine.” The prize: $1 million.

Read More

Labels: ,