It’s about a 4 min. read.
Dear Dr. Jay,
After the 2016 election, how will I ever be able to trust predictive models again?
Alyssa
Dear Alyssa,
Data Happens!
Whether we’re talking about political polling or market research, to build good models, we need good inputs. Or as the old saying goes: “garbage in, garbage out”. Let’s look at all the sources of error in the data itself:
Now, let’s consider the sources of error in building predictive models. The first step in building a predictive model is to specify the model. If you’re a purist, you begin with a hypothesis, collect the data, test the hypotheses and draw conclusions. If we fail to reject the null hypotheses, we should formulate a new hypothesis and collect new data. What do we actually do? We mine the data until we get significant results. Why? Because data collection is expensive. One possible outcome from continuing to mine the data looking for a better model is a model that is only good at predicting the data you have and not too accurate in predicting the results using new inputs.
It is up to the analyst to decide what is statistically meaningful versus what is managerially meaningful. There are a number of websites where you can find “interesting” relationships in data. Some examples of spurious correlations include:
In short, you can build a model that’s accurate but still wouldn’t be of any use (or make any sense) to your client. And the fact is, there’s always a certain amount of error in any model we build—we could be wrong, just by chance. Ultimately, it’s up to the analyst to understand not only the tools and inputs they’re using but the business (or political) context.
[disclaimerDr. Jay loves designing really big, complex choice models. With over 20 years of DCM experience, he’s never met a design challenge he couldn’t solve. [/disclaimer]