Last spring, while I was drafting next article offshoots out of a year-old book-length breastfeeding manuscript, struggling to prioritize depression, breast cancer, or HIV — I had a flash of insight: Simpson’s paradox seemed to be a throughline in all these different stories. Maybe I could tell a story with a simple statistical methodology spine, and so turn three or more offshoot articles into one that I could actually get done — last summer. A clever shortcut. (Spoiler alert: don’t be clever.)
Then I learned the truth about Simpson’s: it’s not Simpson’s, there’s no paradox, the accidental umbrella term can confuse different concepts, and it sounds sciencey but hides the breakdown of the fact-value dichotomy when you get into actually doing “science before statistics.” That’s why, when you hear “Simpson’s paradox,” you should hear the opening lyrics of the Simpson’s theme song, followed swiftly by Homer’s “D’OH!” Then you should feel the sharp pang of the Reaper at your back with the scythe. Because there are no shortcuts. Worse, life is short and science is a human enterprise. We’re doomed to die before getting our work done, but not before making tons of mistakes in the process. That’s why hearing “Simpson’s paradox” should strike existential terror into your heart, like a momento mori of modern science as a culture as fallible and empty signifier spell-casting as any other.
Credit and no blame for any mistakes is due to Sander Greenland, UCLA professor emeritus in the departments of epidemiology and statistics at UCLA, and a leading statistics reform voice working, for instance, to reduce statistical significance testing misuse and improve education so more scientists understand the need to fully interpret the 95% confidence or compatability interval in terms of what is clinically important. When I mentioned Simpson’s, he kindly sent me a bounty of articles that generated this post. It summarizes what I learned falling down this rabbit hole, so you don’t have to. Longer-form notes available on request.
What is Simpson’s paradox?
The Myth
The current statistics myth is that Simpson’s paradox explains apparent effect disappearance or reversal when an omitted variable is added to a model. If you follow research methods, you know that — like in most fields — there’s an upper-echelon discussion with identifiable thought leaders, and then there’s what everybody really does in practice, and there’s a big space between these two conversations. So p-values, statistical significance, confidence (aka compatability) intervals, matching techniques, and a lot of the other basics are defined and used differently on these different levels. So it is with Simpson’s, but in a way that hasn’t quite been standardized yet even at the upper level.
In Cambridge math professor Sir David Spiegelhalter’s 2019 book The Art of Statistics: Learning from Data, Simpson’s paradox explains why an effect can disappear or reverse when data from different groups are combined without accounting for a confound or confounder — a third variable that influences both independent (or intervention or exposure) and dependent (or outcome) variables. Confounds are confounding; they’re confusing, and they can create spurious results. Richard McElreath, director of the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany, and another leading science reformer, defines confounds as features of the sample and how we use it that mislead us.
A classic example comes from a 1973 study investigating gender bias in graduate school acceptance at the University of California, Berkeley. The study found men were 1.8 times more likely than women to gain admission. But, on closer examination, men tended to apply to less competitive programs like engineering, while women tended to apply to more competitive programs like English. Leaving aside the incorrect price signal in an English program being harder to get into than an engineering one, when researchers analyzed admissions within departments, women tended to have a better shot at admission than men. So the department they were applying to was a confound driving the apparent association between gender and admission.
There are many other well-known examples of Simpson’s paradox dealing with demographic categories we tend to be concerned about in terms of social implications. Another classic comes from Michael Radelet’s 1981 study of race and the death penalty in Florida, which found that victim race was a crucial confound in sentencing disparities (slides 9-11; original). Another is racial disparities in policing. One health example deals with exercise lowering cholesterol in every age group, but looking harmful in the aggregate; age is a confound (slide 29). Another current health example comes from COVID-19 data that appear to show Italy having a much higher infection fatality rate than China in the first wave, until you account for age.
In my research, I noticed it’s often said that “exclusive breastfeeding” (giving infants no other food or water) protects against postpartum depression and breast cancer in mothers, and HIV transmission from mothers to infants. But, in fact, maternal health could be driving these apparent effects, which may disappear or reverse when you account for breastfeeding problems, which are often caused by maternal health problems. No surprise that women with more health problems may be less capable of performing the most metabolically intensive task human bodies can undertake (lactation), or that ill health can cause other health problems. But good luck finding data in which researchers took mothers’ well-being and experiences seriously enough that you can actually analyze maternal health and breastfeeding problems as confounds in breastfeeding effect models.
The History
As Greenland pointed out, Simpson wasn’t the first to talk about confounding in statistics. That credit goes to Pearson and Yule.
Simpson made an important contribution, as well as a mistake, and his “paradox” fits into a recent revolution in causal inferences that you need to know to do better science (including better statistics).
Simpson’s Success: Story Drives Statistics
Miguel Hernán is a professor of biostatistics and epidemiology at Harvard who’s been a leading science reform voice working, for instance, to reduce euphemisms implying causation without thinking through causal logic and talking openly about it. In a 2011 International Journal of Epidemiology paper entitled “The Simpson’s Paradox Unraveled,” Hernán and colleagues explain that Simpson’s point was that “From a purely statistical standpoint, no general rule seems to exist as to whether the conditional association or the marginal association should be preferred” (p. 781). In other words, you need to think about cause and effect in order to get statistical analysis right. You have to think about causality to see a confound at work in an analysis, and you have to believe a particular causal story to then choose that model (corrected for confounding) over another one. The story doesn’t come from apparently objective statistics. Story drives statistics.
In this way, Simpson’s success was prefiguring contemporary statistics reform efforts by emphasizing the need for qualitative knowledge — and not just expert knowledge in the traditional sense, but also, we might say, common sense — to play a central role in statistical analysis. He argued that “identical data arising from different causal structures need to be analysed differently” (Hernán et al, p. 784). This is the part that Simpson got right. It’s an important point that we should celebrate Simpson for highlighting.
At the same time, Hernán et al show there’s no paradox here. Simpson was worried about potential confusion from the absence of a general statistical rule for analysts to prefer conditional to marginal association or vice-versa. But we don’t need a rule of thumb for statistical analyses if we make causal inference the explicit goal from the start, and incorporate causal logic into the analysis first, not last. Scientific thinking isn’t about implementing rules of thumb. Story drives statistics, and we have to think critically about our own perspectives (and others’) in constructing that story.
Simpson’s Sin: Confounded Confounding
No one is perfect, hindsight is 20/20, and it turns out Simpson chose a poor example. His was inadvertantly closer to noncollapsibility than confounding. Noncollapsibility is a numeric phenomenon, while confounding is a causal one.
Greenland explains this in a two-part series (1, 2) of 2021 articles in the Journal of Clinical Epidemiology. Noncollapsibility is a numeric averaging failure that can occur when the odds do not approximate the risks — for example, as when the odds are high in any subgroup, this makes them diverge from the proportions, and you can see this phenomenon where simply averaging the odds across subgroups creates a misleading impression of the overall picture, called noncollapsibility. The same problem can affect other frequency measures and comparisons, but it’s worst in odds ratios.
Noncollapsibility without confounding can happen for any measure of association when the covariate is affected by the treatment. Noncollapsibility can also occur with confounding — they can both occur with or without one another. And they can even become equivalent under certain conditions — when the effect measure can be expressed as the average effect on population members provided the covariates in question form a sufficient set for control.
You need to be on the lookout for noncollapsibility especially in matching methods like coarsened exact matching (“cem” to friends and poor sods like me who used it in our dissertation research). Because if you subdivide data again and again into more finely covariate-matched groups that further predict the outcome, then you will see subgroups with ever higher odds that do not match the proportions — and thus ever more noncollapsibility. This is a form of sparse-data bias, but keep in mind that sample size is not the issue — again, it happens especially when the odds in any subgroup are high, so they diverge from the proportions.
So now we have three problems in talking about Simpson’s paradox: (1) it’s not Simpson’s, in the sense that he wasn’t the first to describe confounding, (2) there’s no paradox, in the sense that you should be thinking about causality early and often when doing statistics, and (3) Simpson’s example was inadvertently showing something closer to the numeric phenomenon of noncollapsibility, a failure of subgroups with high proportions to make simple averages in the aggregate, and not the causal phenomena of confounding that he thought he was talking about. But wait, there’s more! People don’t just do good work (Simpson’s success) and make mistakes (Simpson’s sin) — they also come from a particular time and place that influences the way they see things…
Simpson’s Setting: Pre-Causal Inferences Revolution
On one hand, it’s not fair to hold Simpson responsible for knowing, when he was writing his now-famous paper published in 1951, what we know now. On the other hand, “Simpson’s paradox” has arguably become a term of a bygone era, when it was normal for scientists to speak of “omitted variable bias,” and the paradox was the reversal of the true effect sign when the omitted variable was added to complete the model. You don’t hear that language so much among upper-echelon methodologists anymore, because it’s widely recognized that all models omit variables (as in George Box’s aphorism “All models are wrong, but some are useful”). And instead of Simpson’s paradox, scientists who are up on the relatively recent revolution in causal inferences tend to talk more specifically about what’s driving the apparent reversal when it occurs.
It turns out to be really important to do that, because there isn’t just one type of confound — the simple confound, or the Fork, in which Z shares common cause with X and Y. According to McElreath, there are four elemental confounds, including also the Pipe, in which X —> Z —> Y; the Descendent, in which X —> Z —> A and Y; and the Collider, in which X and Y —> Z, but here, X and Y are associated without sharing common causes.
The Fork can confound your analysis when you don’t think through causality properly first and account for a key covariate. But the Collider can confound your analysis when you think you do think through causality properly first and account for a key covariate — but in so doing, accidentally introduce more bias (collider stratification bias) than you correct for, because it throws stones in the causal chain river. Here’s how it works.
A well-known recent example of collider bias comes up in Spiegelhalter and Anthony Masters’s 2021 book Covid by Numbers: Making Sense of the Pandemic with Data: some studies purported to show that smoking protected against bad COVID-19 outcomes. (I’m not linking to these signs of the apocalyptic scientific times; go sully your own PubMed search history.) For those of you who are too young to know Woody Allen films, there’s a famous clip from his 1973 movie “Sleeper” about a man who wakes up in the future and finds everything we thought was unhealthy, like steaks, hot-fudge, and smoking, turned out to be good for you. This is a fantasy; we know that smoking kills people, although eminent statistician R.A. Fisher took tobacco company money and questioned the link (as related in Stephen Senn’s Dicing with Death: Living by Data).
But the fact that smoking is really bad for you also means that, if your health is poor, you may be more likely to quit smoking, since you can’t afford the hit — so current smoking indicates better current health in many cases. Thus, by controlling for factors that are influenced by smoking, we distort any causal relationship between smoking and COVID-19 risk, because we then control for a collider.
A parallel example is in alcohol studies, which for a long time showed a J-shaped cardiovascular risk curve such that epidemiologists thought moderate drinking might be protective. They wrote about the “happy hour hypothesis.” But it turned out that alcohol is poison, any amount of poison can cause harm, and what we were probably seeing was sicker people quitting drinking to prevent that harm, plus socio-economic status affecting health and drinking alike. How do you solve this sort of problem?
Spiegelhalter suggests simpler demographic models avoid collider bias in the COVID-19-smoking case. Just control for age, gender, race, and socio-economic status, and you do see a link emerge — unsurprisingly — between current smoking and death from COVID-19. This perspective fits in with a wider Zeitgeist among leading methodologists who tend to emphasize that we need to keep in mind at all times that we are mere mortals affected by all sorts of human stupidities, and our estimates contain a lot of uncertainties as well as probable biases. Keeping it simple seems wise. Being humble about what we know seems necessary. One way to keep it simple and be humble is to do relatively simple modeling, and then to present it, accordingly, as non-comprehensive.
But this type of solution still leaves the door open for collider bias when researchers treat basic demographics as confounds — when they’re associated with the exposures and outcomes of interest. For example, a number of studies (from the U.S., Taiwan, Denmark, Finland earlier, Finland later…) have found a link between abortion and substantially increased suicide risk. The effects are really huge (several times increased risk), but most abortion providers don’t tell patients about this possible risk. Why? It’s threatening (financially, politically, psychologically) for abortion providers to recognize that their work may be substantially harmful to some women’s mental health. Women themselves might not trust anyone giving them this information in this highly politicized issue environment. And the research literature reassures them that these huge correlations in no way suggest causation — even though the evidence does not demonstrate this.
Some researchers have corrected for confounds including low socioeconomic status in models purporting to show that the association is not causal. Mika Gissler, first author of the standard Finnish abortion study citations and pan-Scandinavian research professor, wrote:
The original finding was published in BMJ 1996 - women with recent induced abortions have three-fold increased risk to suicide compared to non-pregnant women and six-fold risk compared to women giving birth. We have replicated the study also with other causes of death, including homicide and accidents, showing the similar pattern. This suggest[s] that the increased suicide risk shows no causality but common risk factors for induced abortions and suicide. These include substance abuse, low socioeconomic status, short education, and previous psychiatric disorders
(email correspondence, July 12, 2022).
Bracketing a number of other issues (e.g., possible relationships other than independence between risks of death from homicide, accidents, and suicide), what if disadvantaged background is associated with unintended pregnancy, abortion, and mental health problems? This would mean that treating SES as a confound could introduce collider bias that is just as severe as (or more severe than) confounding. It does not occur in a political vacuum when researchers dismiss substantial possible correlations as showing no causality, when the evidence is insufficient to prove that claim. And demographics are often among the most political variables we put in statistical models. There’s no free lunch from thinking about causality and power.
Back to the big picture, there are a few other perspectives on how to approach the problems of confounding and collider bias. One is to run a bunch of different models, or even all reasonable analyses, in what’s called a multiverse analysis. There’s a famous paper by Sara Steegan and colleagues including Andrew Gelman about increasing transparency through this type of analysis. Physicist, philosopher of science, and London School of Economics emeritus professor Nancy Cartwright criticizes this approach as having “no causes in; no causes out.”
Another perspective about how to approach these problems comes from the relatively recent revolution in causal inferences spearheaded by Yuda Pearl, UCLA emeritus professor and Cognitive Systems Laboratory director, and colleagues, and it lets you draw causal logics in a way that makes for better science and statistics, including by being able to see colliders — something that we’re otherwise not able to do cognitively. This requires understanding a group of related concepts starting with d-separation (d as in dependence; see also “d-separation without fears”), its opposite d-connection, and how and why to combine graphs and probabilities to draw conditional independence in a special type of causal logic drawing called Directed Acyclic Graphs (DAGs, a common heuristic form of a causal model comprised of graphs with no paths — or cycles — from a variable back to itself, so directed graphs that are acyclic). You don’t need special software to do this type of drawing or graph, although it exists — DAGitty is a free one, and their website has a good list of related software, too. I’m going to stop here and recommend other resources for further learning about DAGs, because there are some really wonderful materials on this that it’s worth spending a little time with if you’re interested.
To be clear, there’s a vast literature on this and it’s still fairly new to me, but that’s a sign of the times in a way; there’s been a revolution in causal inferences relatively recently, and most scientists and science communicators haven’t caught up with it yet. As Stevie Wonder says, “People keep on learning.” So I’m just doing a newbie’s pass over the terrain to get an explainer out there on why when you hear “Simpson’s paradox,” you should think — “D’OH! Is this the numeric problem of noncollapsibility, the causal problem of simple confounding, or another confound problem, like collider bias?” — and you may want to start by drawing the causal logic you think is in play, by drawing DAGs, to figure it out. Don’t be clever, as Richard McElreath, director of the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany, and a leading science reformer, warns. Tool up and take every little step visually, for we are silly monkeys.
Simpson’s Significance: A Sign of “Objective” Science that’s Not?
What does it all mean? I would argue that the term “Simpson’s paradox” is a symbol of apparently objective, neutral mathematical power that researchers have tended to inadvertently invoke in order to privilege one narrative over another without explicitly saying that scientists do this all the time — that statistics is interpretive, and we can not get away from values and power when we collect, evaluate, and interpret evidence. This gets into philosophy of science issues concerning the fact-value distinction. There’s a long tradition of thinkers writing about this, including more recently Princeton history professor Theodore Porter in his books The Rise of Statistical Thinking and Trust in Numbers, and Greenland in his work on statistical significance testing misuse and other common methodological mistakes that can serve powerful interests by privileging preferred narratives under the auspices of apparently objective science.
And stretching back further most famously to the 18th century Scottish Enlightenment philosopher David Hume, who broke from Aristotelian tradition to argue on the basis of a newly empiricist is-ought distinction that the sorts of value conflicts that drove bloody European religious conflicts of his time were not natural but artificial, context-dependent, and unnecessary. Immanuel Kant was inspired by Hume in his writing on analytic versus synthetic judgments. Auguste Comte, who’s sometimes credited as founding sociology and philosophy of science, founded a conceptually related philosophy of logical positivism influenced by Scottish Enlightenment philosopher Francis Bacon that in turn influenced many other philosophers’ work on the fact-value distinction, including J.S. Mill and Émile Durkheim. And in the late 19th century, German sociologist Max Weber built on this idea of separation further, so it became a basic tenet of research involving human beings across social and health sciences.
In the 20th century, there was a pendulum swing back toward questioning this fact-value dichotomy. For example, Bruno Latour in science and technology studies led a movement to study scientists engaged in doing science as anthropologists and sociologists studying a culture and a society. Science scholar Naomi Oreskes points out in Why Trust Science? that this is consistent with Comte’s emphasis on considering the positive method in action; but, at the same time, (I think) it represented a break in practice and intellectual culture away from reifying the dichotomy. Another example is William Silverman in the evidence-based medicine movement, who noted, for instance, that in designing a clinical trial, researchers have to answer the value-laden question of what to consider an important difference in treatment outcomes — and they usually do so without consulting patients or the public, which keys into debates about the difficulty of squaring complex science with liberal democratic values, sometimes referred to in terms of the Dewey-Lippmann debate in American philosophy and political science.
In the big picture, it would be more comfortable for us as modern secular followers of the religion of science to say that Simpson’s paradox is a purely mathematical phenomenon; but it’s not necessarily. We could call the numeric instantiation of it, noncollapsibility, purely mathematical. But usually when people refer to Simpson’s paradox, they’re talking about causal story confusions — often failing to distinguish between confounding and collider bias, when that’s an important distinction in scientific terms. So we need to recognize that values and perspective tend to influence many choices that go into collecting, analyzing, and interpreting evidence. And get comfortable with that idea, taking science off the religion model and getting real about how power tends to shape ways of seeing the world.
If this explanation is right, then there’s a beautiful irony in all this: Simpson’s main point — story drives statistics — is getting lost in dressing up science as “objective” when it’s not. Poor Simpson. Maybe we could still call that a paradox, after all.
This still leaves the question: why hasn’t the term “Simpson’s paradox” been purged from the statistical lexicon yet, if it’s so… Wrong? One, relatively benign possible explanation for the holdover is that there are too many other, worse mistakes for too few upper-echelon methodologist-reformers to care about getting it done. But it’s also possible that it’s serving a cultural function. That’s a more nefarious possible explanation that perhaps we should entertain, striking further existential terror into our hearts.
If you seize on “Simpson’s paradox” to make sense of an apparently weird statistical result, you’re seizing on, at best, an outmoded terminology that won’t get you as far as you need to go in doing science. And, at worst, a dressing-up of the naked emperor of purported objectivity in sciencey words that can cloak the centrality of values and perspective in the story-telling enterprise that is statistics. You’re saying, “I thought more and re-ran the numbers, and what ‘science says’ changed.” As if the Oracle had been mysteriously moved to alter Fate.
Summary: tl;dr - Simpson’s paradox doesn’t really exist.
Don’t Say:
Simpson’s paradox explains apparent effect disappearance or reversal when a previously omitted confound is added to a model.
Do Say:
“The Simpsons… ‘D’OH!’ ”
Not Simpson’s, not a paradox, check for non-collapsibility (especially with odds ratios, especially with matching techniques like cem where you’re creating sparse data by design), meet the causal inferences revolution, and spot the fact-value distinction breaking down on closer examination while commonly doing work for powerful interests.
Original Recommended Readings
“Confounding and Collapsibility in Causal Inference,” Sander Greenland, James M. Robins and Judea Pearl, Statist. Sci. 14(1): 29-46 (February 1999).
“Quantifying Biases in Causal Models: Classical Confounding vs Collider-Stratification Bias,” Sander Greenland, Epidemiology 14(3):p 300-306, May 2003.
“Causal Diagrams,” by M. Maria Glymour and Sander Greenland, Chapter 12.
“Quantification of collider-stratification bias and the birthweight paradox,” Brian W. Whitcomb, Enrique F. Schisterman, Neil J. Perkins and Robert W. Platt. Paediatr Perinat Epidemiol. 2009 September; 23(5): 394–402.
“The Simpson’s paradox unraveled,” Miguel A. Hernán, Davi Clayton, and Niels Keiding. Int J Epidemiol. 2011 Jun; 40(3): 780–785.
“Non-collapsibility, confounding, and sparse-data bias. Part I: The oddities of odds,” Sander Greenland, Journal of Clinical Epidemiology 138 (2021) 178-181.
“Noncollapsibility, confounding, and sparse-data bias, Part 2: What should researchers make of persistent controversies about the odds ratio?” Sander Greenland. JCE, Vol. 139, p. 264-268, Nov. 2021.
I enjoyed your style and humor in this article! There are so many links to works that I'd like read. Maybe I'll send this to my neighbor with the "I believe in science" bumper sticker.
Many thanks from a bored epidemiologist.
A wonderful essay, thank you. Well-written and a pleasure to read...has motivated me to get back to J Pearl's book.