If I showed you a chart like this, and I said the blue line is the marketing budget in dollars, and the red line is the amount of product purchased. How many of you would think that increasing the amount of money dedicated to marketing leads to increases in customer purchasing? So this is what the graph actually shows. The blue line shows the number of movies Nicholas Cage was in per year, and the red line shows the number of people who died that year by falling into a pool. I am not kidding about that. I got this graph from www.tylorvigen.com. A very funny website, I highly recommend that you check out to see other examples of bizarre things that seem to correlate. This graph is made from real data though, from the Centers of Disease Control and Prevention and the Internet movie database. I promise you it's not made up. You can go look it up and pop the numbers yourself. As you can see it looks to our eye like there is a decent relationship between these two things over the years. Now knowing that, how many of you think that the number of movies Nicholas Cage is in per year causes increases in pool drownings? Or that the number of pool drownings per year increases the number of movies Nicholas Cage agrees to star in? I'm gonna hypothesize that some of you are willing to guess that the amount of money invested in marketing does indeed cause customers to purchase more. However despite the fact that I showed you the exact same data in both graphs I asked you to evaluate. I hypothesise that many few of you were willing to say that Nicolas Cage movies cause pool drownings. Or that pool drownings caused Nicolas Cage to star in more movies. And for the record you were correct if you said that Nicolas Cage movies do not cause pool drownings. And that pool drownings do not cause Nicolas Cage to star in more movies. When the match between two sets of data is compelling and close, like these examples are, psychologically it's very easy to tell yourself that one thing you are measuring is causing the other. It's also very easy for an audience to tell themselves the same thing. It can actually be quite difficult to prevent yourself from telling that causal data story. However, as the Nicolas Cage pool drownings example shows us because it is so bizarre, as well as the rest of these real life examples that I've now shown with their true titles. Just because two variables correlate or seem to be linked to one another, it does not mean that one causes the other. Correlation does not equal causation. Before we continue, it might help to define some terms. Correlation refers to the phenomenon of two things having a tendency to vary together over multiple time points or multiple measurements. The number of Nicolas Cage movies and number of pool drownings were correlated in our example. Causation refers to the phenomenon of one thing happening as the result of the other thing. If I punched you and you got a black eye, you've done enough scientific experiments to know that my punch caused your black eye. If I punched you again, you'd get a black eye again, and I'd probably get a black eye too, since you'd probably punch me back. A coincidence is when purely by chance two things happen at the same time. A spurious relationship is when either due to chance or due to an unmeasured variable, but not by direct causality, two things correlate. Let me explain what I mean by an unmeasured variable. The coincidence or spurious relationship we observed on our graph earlier might be due to a variable we didn't pay any attention to, summer activities. Nicholas Cage tends to be in blockbuster movies that are released during the summer, especially summers with big events. In addition, people go into the pool primarily during the summer, and more pool parties might occur during summers with big events. So the reason Nicholas Cage movies and pool drownings might be so correlated is because they're both related to the unmeasurable variable, summertime. I don't know if that's the case, but it's a good hypothesis that might explain the spurious relationship we see. Now that we know know what these terms are, let's return to why the issue of correlation not equaling causation is dangerous for business analytics. If the reason you're analyzing data is to determine what changes we should make in your business processes, you are determined to make changes that will cause improvements in your business processes. The changes you recommend are due to graphs that only show correlation but do not show causation. Your recommendations might not work or even worse, they could cause decreases in business performance. Let's look at a business example that makes this very clear. Imagine you run an analysis that finds the number of Internet security breaches companies have correlates with the number of engineers companies have in their Internet security department. This surprises you and goes against your intuition, so like a good analyst you find another data set with another set of companies and see if you can replicate the effect. You do. Based on this data, you might then infer that having a great number of engineers in your Internet security department causes more security breaches. As a consequence, you then recommend that your company reduce it's number of engineers. If you did this, however, your recommendation would have catastrophic consequences of course. Your company would probably end up with an even greater number of security breaches. The reason why is that having a greater number of security breaches caused executives in your company to seek help from engineers, not the reverse. The increased numbers of engineers might not have solved the security breach problem, but if you fire all of your engineers you'll have no chance of solving the security problem. Now, since most of us know that it will be impossible to fight Internet security breaches if you don't do something to protect your Internet system, this would have been a logical mistake that was very easy to identify. But what if I replaced the labels on this chart with labels depicting the number of larger ads on a web site and click-through rate? Click-through rates represent the percentage of users who click on an advertisement link when they see it. So this graph shows us that the larger the ad, the higher the click-through rate. I want you to imagine that are a data analyst working on a marketing project. Really close your eyes and create a vision of what that would look and feel like. You are new to the job, you really wanna impress your boss. You spent the past few days getting and cleaning your data. When you finally get the data into Tableau, and in a couple seconds, you see this graph. I want you to really think about this. If you were in that situation and you saw this graph, would you be tempted to assume that larger ads cause higher click through rates? And you would you be tempted to come to the conclusion that the company should invest their money in larger ads? If so, you are now getting a flavor of why misinterpreting correlation is so common in business, but also so dangerous. I know these conclusions sounded tempting, but the data on this graph do not support a causal relationship anymore than they did when we were looking at the number of security engineers and security breaches. Recommending that the business invest in larger ads without any further information would be a risky financial recommendation to make. What kind of information would make that recommendation less risky? Watch the next video to find out.