As analysts, it is our job to not only make our clients look smart, but to also make sure we are looking smartly at the data. From the inception of a project, we examine and re-examine the logic of our research plan and variables. This is important because there are many classical misuses of statistics to watch out for. In this blog, we talk about a one such misuse, data dredging.
Big data, big insights? Sure, if used correctly. However, over-fitting a model on an excess of variables can lead to false findings. This process is called data dredging.
You will sometimes hear analysts talk about confidence intervals, most often a 95% confidence interval. What this means is that there is a 95% chance that the research finding is not due to random chance. There is a flipside to this coin however. There remains a 5% chance of finding a relationship between any two completely unrelated variables. This is more likely to become a problem with big data sets with too many variables because a larger number of pairs increases the odds of finding a spurious (bogus) but apparently statistically significant result.
In other words, if you give me enough unrelated variables I could discover a bizarre relationship such as “prospects born in September who prefer pepper jack cheese on their roast beef sub are 5% more likely to respond to your acquisition package.” Seriously… I could do that. But, it probably wouldn’t be helpful and it most likely a false relationship.
How then do we ever know then when the results of a model we build are valid and that we have chosen the right variables? One best practice is to build your model using only a random sample of your available cases, and then testing your new model on the cases you initially reserved and did not use to build the model. If the model works the same on the new cases, the model is valid.
If you want a super model, you have to look smartly at your data. You didn’t really think this blog was going to be about Giselle Bundchen, right?