I've Got My Data, Now Where Do I Get My Science?

Mark Voortman

Updated on September 9, 2020 ·

4 minute read

The short answer: At Othot.

The long answer: Well, you better buckle up because the science part turns out to be a tricky business. Let's look at a few well-studied examples. It is a known fact that when ice cream consumption increases, the number of drownings increases as well. Maybe there is some truth to the warning that many of us have received from our parents: you shouldn't swim after eating (ice cream, in this case). The good news is that you can keep stuffing your face with ice cream because there is no convincing evidence (as far as I know) that it will increase the likelihood of drowning. So what's going on here? As you may have guessed by now, the explanation is that both ice cream consumption and drownings are direct causes of another variable, namely hot summer days. (As a data scientist you always have to be cautious. Ice cream consumption could potentially cause drownings directly but one would have to pose a mechanism and do measurements to validate it.) This example is usually presented in the context of the maxim that correlation does not imply causation.¹

Correlation is usually taken as a hint to some sort of causal relationship, but the exact relationship will require additional study (such as, for example, a randomized experiment). Even though correlations usually hint at causation, this may not be the case. A classic example is the water level in Venice and bread prices in England; while these are clearly not causally related they are correlated because both are increasing over time (this correlation will disappear if you were to compare the rate of change of the water level and bread prices over a period of time).

The next phenomenon is so famous that it has its own name: Simpson's paradox.² In the seventies Berkeley was sued for bias against women because the percentage of women that were admitted (out of the number of women that applied) to graduate school was significantly smaller than the corresponding percentage for men; it seemed unlikely this could be due to chance. For men the number was 44% while the number for women was only 35%. This is a clear-cut case, right? Wrong. It turns out that when these numbers were broken down by department, most departments admitted a higher percentage of women than men. How can these two things be true at the same time? While most women applied to competitive departments where overall admission rates are low, men had a tendency to apply to easier departments and so more got in overall. There are many other real life examples of Simpson's paradox, often with important implications.

The hot summer day and department variables above are called confounding variables³ (a.k.a. confounders, confounding factors, lurking variables, hidden variables, latent variables, conflating variables, ...) because, well, they confound (causal) analysis and are unknown (at least initially). In the previous paragraph, when the data is broken down by department the trend seems to reverse. This makes a lot of people feel uncomfortable and they may think that it won't affect them. However, here is an example that is (possibly) a bit closer to home. Suppose you are trying to determine what the causal relationship is between marketing spend and product sales. There may or may not be a relationship and if there is a relationship, it has to be quantified. What could the confounders be for this scenario? Perhaps your products are driven by seasonal demand, so this would mean you have to include the month (for example) of the year in the analysis. But sales could also be driven by the weather (are you selling ice cream?!) so you have to consider weather as a confounder as well. In fact, there could be many other potential confounders. One may think that breaking down the data into more and more subgroups will always result in a more correct analysis, just like breaking down admission numbers by department did the trick before.

Here's the nub; adding in all (potential) confounders can lead to an incorrect analysis! This may be a subtle but certainly important point. First of all, there is the matter that when the number of confounders increases, the amount of data for each subgroup decreases. For example, there are 12 months in a year but if you also consider hot, moderate, and cold weather this is a total of 12x3=36 different subgroups that all have to be investigated separately with data. Incidentally, in this case there seems to be a clear relation between the month and hot versus cold weather but it may not be immediately obvious how this relationship can be utilized. But more importantly, grouping data by confounders can induce a previously non-existing effect between two variables.

Here is a simple illustration of that last point. A flat tire can result from both wear and tear of the tires on your car but also nails on the road. Note that there is no relationship between wear and tear of a tire and nails on the road; they are independent of each other. However, upon getting a flat tire, they are now dependent (which may mistakenly be interpreted as causality!) as one explanation may become more likely and therefore it “explains away” the other. For instance, if you know that you have been neglecting your car, nails on the road become less likely of an explanation for a flat. Alternatively, when you actually see nails on the road, the issue was probably not wear and tear of the tires. Understanding this phenomenon is crucial in many cases and known as explaining away, which is very similar to Berkson's paradox.⁴

How are you to handle all these confusing and sometimes counterintuitive results? The good news is that there has been a lot of progress in the last decades on developing formalisms to model all of the above scenarios correctly. This did not come easily; while Simpson's paradox was known (but not named) at the end of the 19th century it took almost a century to satisfactory solve the paradox. The magical phrase in this context is Bayesian networks.⁵ To be continued...

PS: I'd like to point out that I used zero equations and did not use the phrase probability distribution either in this article!

Sources:
¹ https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
² https://en.wikipedia.org/wiki/Simpson%27s_paradox
³ https://en.wikipedia.org/wiki/Confounding
⁴ https://en.wikipedia.org/wiki/Berkson%27s_paradox
⁵ https://en.wikipedia.org/wiki/Bayesian_network

Mark Voortman

Mark is a Data Scientist Architect at Othot.

I've Got My Data, Now Where Do I Get My Science?

Blog Subscription Form