Data Analysis 101: Data analysis pitfalls to watch for

The mere fact that we already had four articles about data analysis before this should remind us that data analysis is a serious business. The decisions made using it often have far-reaching consequences.

In this article we list down common pitfalls to watch for when doing data analysis. These pitfalls negatively affect our decision-making. As it is now possible to conduct sophisticated analysis without the need to learn the dirty details of the specific algorithms and methods, we need to focus more on applying them correctly.

Let us now start learning how to avoid these pitfalls!

Confirmation bias

Confirmation bias is one of the most common biases humans have. Everyone has committed it at least once in a while. Fs.blog defines confirmation bias as our tendency to cherry-pick information that confirms our existing beliefs or ideas. While it’s a human trait, confirmation bias clouds our judgment and distorts our perception of reality.

Towards Data Science lists three ways how confirmation bias sneaks to our process of data analysis:

Someone wants to use analytics to support their agenda or aim
The analysis is re-done until it eventually yields a palatable conclusion
The analysis provided by the analyst is recast in a way that looks like it supports a desired conclusion

Confirmation bias not only sneaks its way into the process of applying data analysis, but it can also influence the entire process:

The process of defining the problem can be fraught with all kinds of biases. It is tempting to frame the problem such that the only conclusion possible suits one’s agenda or hypothesis.
The most common manifestation of confirmation bias involves selection of data that suits one’s agenda or hypothesis. There is a thin line between selecting the relevant data and cherry-picking. One must be most careful during the process of data collection and cleaning.
As highlighted above, you can actually massage the data or beat it enough to give you the result that you want. There is a reason Mark Twain said “Facts are stubborn things, but statistics are pliable.”
Even if you have carefully framed your problem, meticulously collected and cleaned the data, and chose the best analysis methods, your interpretation of data is still affected by biases. In fact, another common manifestation of confirmation bias is to look at the same set of data and get different conclusions.

How can we keep confirmation bias from tainting our data analysis and decision-making? Here are some tips, which is equally applicable to all things you do for your business:

Ask neutral questions. This does not only involve framing our problem (as we need to list down questions that have to be answered in order to solve the problem) but it also involves our construction of surveys if we need to gather new data through market research. To ensure that bias is minimized, you should enlist the help of third-party checkers that will vet it.
Play devil’s advocate. While it is fairly common to find someone who will disagree with the steps and decisions that we make, it is more dangerous for the company if nobody raises a question or two to make them stop and think first as to why they are performing a certain action in the first place. When done correctly, playing devil’s advocate ensures that you have taken care of all possible objections before doing a certain action.
Listen to gut feelings. Gut feelings often work when there is something you are doing that you are uncomfortable doing so. Listen to it! Take a pause, and think again before doing an action.

Taking these into account is a good start in combating confirmation bias in our data analysis.

Survivorship bias

You have probably read the story of WW2 bombers that is often circulated in social media as an example of thinking outside the box. I include my own version here:

During WW2, the USAF was losing a lot of bombers due to enemy fire. A statistician was asked to look at the data and make recommendations to the USAF on where to add additional armor to keep them from being shot by German anti-aircraft guns. The USAF wants to add additional armor to the sides that sustained heavy gunfire. The statistician thought otherwise; the USAF should add more armor to the engines. The statistician argued that because the engines sustained little to no gunshot marks, the planes that got hit at their engines were actually shot down, and is not included in the initial data.

This is a good example of survivorship bias, which is very prevalent. The Data School presents a few ways survivorship bias manifests itself in data analysis:

Inferring a norm: The things that survived a process are the only things that ever existed. Our example above of WW2 bombers is an excellent example of this.
Inferring causality: Anything that survived a process was impacted by that process.

The Decision Lab lists two ways of preventing survivorship bias:

Vet your data sources: it’s possible there is missing data that would affect your data analysis and decision-making. As we have outlined in our previous article, we should exhaust all accessible sources of data first.
Ask yourself what you don’t see: What data is missing? Consider everything else that didn’t make it. Do you need to gather data about them? Or is their absence enough to tell us what actually happened? This question is especially important when doing diagnostic analysis.

Combating survivorship bias is essentially asking questions about what’s missing in the picture. Once you have included them in your analysis, you have managed to minimize or eliminate that bias.

Focusing too much on data and metrics

In the process of data analysis, it is easy to get lost with the numbers and data. This is especially worse when you “catch” the so-called metric fixation. Metric fixation is defined by Jerry Z. Muller in The Narrative:

“The key components of metric fixation are the belief that it is possible–and desirable–to replace professional judgment (acquired through personal experience and talent) with numerical indicators of comparative performance based upon standardized data (metrics); and that the best way to motivate people within these organizations is by attaching rewards and penalties to their measured performance.”

As business data analysis is designed to summarize data into a set of metrics for interpretation, we need to make sure that we do not “catch” metric fixation. Stacey Barr lists three beliefs that drive the idea of metric fixation:

Metrics tell the whole truth.
Metrics always tell the truth.
Metrics motivate people.

When we construct our problem statement, this includes a set of questions that may require you to calculate a set of metrics to help you make decisions. These metrics, however, are not enough in giving you the whole picture. There are still a lot of details these metrics don’t capture, and they are important in balancing the decision-making away from fixating on the metrics to having a holistic approach.

To keep yourself from fixating on the data and metrics, you should look at the bigger picture. Remember: data analysis and metrics are tools to decision making, not substitutes. As we have seen in the example of survivorship bias, a good look at the context leads the decision-makers to the right actions.

Dredging the data for patterns

*You can get stuck when you are spending too much time analyzing the data.* *Image source*

One trait of humans is to find patterns everywhere. Often, finding patterns is useful. Consider this example used by Psych Central:

False positive: You hear a loud noise in the bushes. You assume it is a predator and run away. It was not a predator, but a powerful wind gust. Your cost for being incorrect is a little extra energy expenditure for your false assumption.
False negative: You hear a loud noise in the bushes and you assume it is the wind. It is a hungry predator. Your cost for being wrong is your life.

This is a simple example of why pattern-finding is important to humans.

However, we often find patterns where there are none. Seeing dogs and cats in the shapes of the clouds? Check. (I am guilty of this one; it’s fun!) You get lucky because a constellation is up tonight and you are Gemini? Check. You hear Satanic verses when you play a famous song in reverse? Check, check, and check.

And the machines we use today can also find patterns in the sea of data. Machines are now improving their pattern-recognition capabilities. Some believe that AI has already surpassed us in pattern-finding.

However, just like humans, machines can also find patterns that do not exist in reality. It is also fairly common for people to take advantage of this to justify their actions. As we have mentioned in the section on confirmation bias, we can actually make the data yield a favorable result. Similarly, an unrelated set of data can be “dredged” to make a pattern appear at will. This is called data dredging. You can apply several types of analyses and algorithms until a pattern appears. That pattern, however, does not exist in reality.

Data dredging is often committed due to lack of awareness on the side of the analyst. The solution is to specify the methods to use when the problem is defined. For certain topics, there is a prescribed set of best practices and protocol in processing data which makes the choice of analysis method easier.

While previous data analyses conducted to tackle the same problem or a related problem can help outline the analysis method to apply, one should ensure that they did not also suffer from data dredging.

Correlation is not causation

One important type of pattern is correlation. Correlation is a quantity that describes the tendency of two quantities to vary together. For example, we can say that quantity A and B are correlated when both increase or decrease given a certain change in another quantity.

But does having correlation mean that the two quantities or events are linked. That is, one event causes the other? Not necessarily. It is easy to find correlated quantities or events that have zero relation to each other in reality. One example is the picture at the beginning of this section. Does the marriage rate in Kentucky affect the number of people drowning from fishing boats? Nope. But the correlation exists. Another question: is Facebook driving the Greek debt crisis? What do you think?

How about this one? Image source — *How about this one?* *Image source*

These are few examples of spurious correlations - two events that are correlated, statistically speaking, but not related in terms of causation. Causation means that one event is caused by another. This is another side effect of humans’ ability to find patterns.

While spurious correlations can be easy to spot by looking at the context of the two correlated quantities, it may be difficult to filter out spurious correlations if they seem to make sense. How can we check for the existence of causation? Fortunately, as Amplitude explains, causal relationships don’t happen by accident. There are two ways of testing for causation, in the context of business data analysis: hypothesis testing and A/B testing. We already talked about hypothesis testing in one of our previous articles. You can check it out here.

Let us talk about A/B testing. A/B testing is a way of testing changes to your e-Commerce site, be it your landing page, your product page, or even your checkout page. According to Hubspot, you need to create two different versions of one piece of content with changes to a single variable. Then, you'll show these two versions to two similarly sized audiences and analyze which one performed better over a specific period of time. A/B testing helps marketers observe how one version of a piece of marketing content performs alongside another.

These two methods can be used together to analyze the relationship between two variables better.

Including outliers or ignoring them

In the process of data preparation, you may notice a set of outlier values. Outliers are data points that stray from the existing patterns in the data. Outliers can affect the results of data analysis if not cleaned out.

Identifying outlets is not difficult; it can be easy. Statistics by Jim lists five ways of identifying outliers. The two easiest ones are as follows:

Sorting your datasheet - if the outlier has a significantly different value compared to the rest of the values, then it is easy to spot them through a simple sorting of the values. The outlier would either appear at the start of the list or at the end of it.
Graphing your data - if sorting does not reveal obvious ones, the next technique is by graphing the entire set. There are certain types of graphs and charts that help highlight outliers in the data. They are as follows:

Scatter plots consist of points representing different data points in an x-y plane. Scatter plots are used to visualize the correlation between two variables. Outliers manifest as a dot or two outside the pattern.

Histograms plot the distribution of values of a certain variable in the form of bars, where each bar contains the values that fall within a certain range. Outliers manifest as a small peak of bars in either end of the range.

Box-and-whiskers plot visualize the range of values of a certain variable as a vertical box bounded by thin vertical lines marked with a short horizontal line on either end. Outliers manifest in the plot as a point or asterisk outside the range.

One sophisticated method for identifying outliers is the z-score. The z-score quantifies how far a value is from the mean of its set. To understand this, you could imagine your dataset having a so-called normal distribution, with most values close to the mean.

*The normal distribution plotted as a histogram.* *Image source*

To describe the spread of the distribution of your data, you need to calculate the standard deviation. As datasets would vary in how wide the distribution of values is, the standard deviation is used as a “ruler” to measure how far a certain value is from the mean. The z-score is essentially a measurement of how far a value is from the mean in terms of standard deviations. A positive z-score means that the value is higher than the mean while a negative z-score means that the value is lower than the mean. The higher the absolute value of a z-score is, the farther the value is from the mean. Outliers get a high absolute value of z-score.

However, the big question regarding outliers is whether to clean them or not. Why? Depending on the question at hand, you may actually need to zoom in on those outliers to understand the problem and solve it. At this point, you will need the power of diagnostic analysis.

Summary

To summarize, we discussed six common pitfalls to watch for:

Confirmation bias - looking for data that affirms one’s hypothesis while disregarding conflicting evidence
Survivorship bias - failing to take into account potential missing data that otherwise contradicts the conclusions that only consider the data present
Focusing too much on data and metrics - substituting holistic decision-making with fixation on data and metrics
Dredging the data for patterns - applying too much analysis on data such that patterns are “uncovered” even though they do not exist in reality
Correlation is not causation - concluding that the correlation proves causation even though a careful analysis of the causal relationship proves otherwise
Including outliers or ignoring them - making a wrong call on whether to consider outliers in the analysis or to ignore them

These are just some of the pitfalls in data analysis to watch for. If they sound like statistics to you, it’s because they are often discussed in statistics.

I hope you learned a lot from our five-article series on data analysis. We talked a lot, but by this time you can now make good calls on analyzing data. To further enhance your data analysis skills, I recommend that you check our upcoming app Lido. As it automates data gathering and analysis, you won’t fall victim to the pitfalls listed here. Instead, you go straight to the metrics and make the right decisions for your business. Get started for free.