Science Fiction
Bad abortion-suicide research turns risk upside-down, but will the authors retract?
A spectre is haunting science — the spectre of publication. It haunts science past, science present, and science future. It’s hard to correct the publication record (the ghost of science past). So mistakes generally live on — nonreplicable publications get cited more than replicable ones, and citations continue even after that rarest of corrections, retraction. Doing better science today (the ghost of science present) is hard, because “We need less research, better research, and research done for the right reasons,” in Doug Altman’s phrase; but universities and corporations pay for publication quantity and even content — not for quality. The ghost of science future is only the aspiration to keep learning that should haunt us all.
Methodology criticism can be misunderstood as nit-picking, or nerds scolding people who know less than us so far, when we’re all human beings on this journey. It’s important to do it anyway, particularly when bad science endangers vulnerable subjects. So reformers have to emphasize the human and ethical elements in these stories. These are the first principles that come first. That’s also how we might shift norms within the incentive structure that has so far made correcting the scientific record the critic’s adversarial and infrequently successful role, working alone against the grain of misconception, perverse incentives, and inertia.
Researchers who publish bad science should clean up their own messes for the simple reason that hurting people is wrong, and fiction masquerading as science has the potential to hurt very real people. Being sorry for making mistakes (no bad intent assumed) and wanting to prevent possible harm should motivate voluntary retractions in egregious (but not uncommon) cases where authors have gotten science demonstrably wrong, and letting the publication record stand may contribute to preventable suffering, harm, and even deaths. This would address (but not vanquish) the ghosts of science past and science present.
Here’s a test case…
Abortion Restrictions and Women’s Mental Health
Abortion is associated with around a 2x increased suicide risk (causality unknown), as I’ve written previously. Zandberg et al’s recent JAMA Psychiatry article “Association Between State-Level Access to Reproductive Care and Suicide Rates Among Women of Reproductive Age in the United States” (Dec. 28, 2022) ignores a wide array of evidence on this reliable and practically significant effect. It presents results from an analysis that the authors claim show a substantial increase in suicides among reproductive-age young women in states that restricted abortion access. This spins a massive abortion-suicide link into what looks like a practically significant abortion restriction-suicide link — its functional opposite — without evidence. If it remains in the publication record, it may contribute to widespread misinformation about possible abortion risks, including prominent abortion providers wrongly telling pregnant women that abortion carries no relative risks, when it may substantially increase suicide risks.
There are many problems with Zandberg et al’s analysis, but this post focuses only on the one calculation at its heart. On p. E5, Zandberg et al say that they took the nonnormalized beta point estimate of their state abortion restriction variable (the creatively named “TRAP law index”), .32, and divided it by the average suicide rate per 100,000 women in states without abortion restriction law enforcement, 5.5. The relevant sub-sample of women aged 20-34 was only 1022 women. So there were zero expected suicides in this group, because 5.5/100,000 equals .05621/1022. In other words, an increase of 5.81%, as the authors claim their analysis found, on zero — is zero.
Ok, measuring changes in rare but serious outcomes is hard and still worth trying. But in addition, this calculation builds on dubious analytical choices that inflate the estimate in a few ways. The first is selective reporting of statistically significant results and non-reporting of non-significant results, also known as p-hacking. This is an unacceptable practice that the journal reviewers and editors should have noticed and stopped from happening, and the authors should not have engaged in.
Panel A of eTable 5 in the supplement shows subgroup analyses by ages 20-24 (.052-.26, p = .004), 25-34 (-.015-.30, p = .074), and 35-44 (-.11-.23, p = .481). By contrast, the article’s Table 1 omits the age category 35-44, jumping from reporting 20-34 in one column to age 45-64 in the next. The authors then repeatedly refer to the category 20-34 as reproductive-age women. The category of 35-44 should also be included, and the article gives no reason for or acknowledgment of their exclusion.
This omission (not the only one; see also eTable 5, Panel B) may be particularly practically significant because, as discussed previously, age seems to play a role in the abortion-suicide phenomenon. For instance, Luo et al 2018 and Gissler et al 1996 (Figure 1) found worse suicidal ideation and more suicides, respectively, among older women who had abortions. There are plausible explanations for these effects including the post-abortion return to non-pregnant progesterone (“Nature’s Valium”) levels that decrease with age.
So Zandberg et al’s supplement shows they had suicide data for older reproductive-age women, but omitted it from their results and interpretation without explanation. Those data returned non-significant results in their analysis. This contradicts their statement in the abstract that “Findings remained significant when using alternative, broader indices of reproductive care access and different age categorizations.” Maybe it was an honest mistake. It’s still p-hacking.
The second source of apparent inflation is using the non-normalized figure for the main calculation, a possible source of confusion but not a problem. The non-normalized point estimate is the upper bound of the normalized one. The article gives no reason for using normalized figures in Table 1 but non-normalized figures in the calculation. The latter is a good choice, because standardized regression coefficients can be misleading. It should still be justified, because it stands to create unnecessary confusion. To avoid all these potential problems, Table 1 could have reported non-normalized estimates.
Finally, like all the other methodologists on the face of the planet, I’m still screaming that scientists need to please report and interpret full confidence or compatability intervals in light of practical significance, instead of reporting point estimates that attained statistical significance. Zandberg et al’s article is one more example of statistical significance testing misuse that risks hyping bogus claims. One facet of this misuse is that the authors inflate their findings by using the point estimate instead of the full compatability interval in reporting and interpreting results.
The 95% CI for the abortion restriction variable reported in Table 1 is (.03-.32). This borders on statistical insignificance; the lower bound is very close to zero. Interpreting the full interval, in accordance with best practices, would highlight that fact. This suggests it’s unclear from their analysis that there is an effect.
To be precise, Zandberg et al report interpreting their nonnormalized point estimate .32 by dividing it by the 5.5 suicide rate to yield the 5.81% claimed increase associated with abortion restriction laws. Bracketing all aspects of their procedure and interpretation, we can calculate the main finding for the reported compatability interval (.06-.58) by adding the nonnormalized lower and upper bounds to the suicide rate 5.5, and then taking the ratios over 5.5, yielding 1.03-1.0581. This shows a relative increase of 3-5.81%. The article’s reported main finding is thus the upper bound of the range of effects it should have reported, bracketing other concerns.
Through p-hacking and statistical significance testing misuse, this article spun an invalid, inflated, and possibly practically meaningless result into what might be misinterpreted as evidence of a substantial link between abortion restrictions and suicide. (And was.) In contrast, a wide array of evidence that the article ignores shows a massive association between abortion and suicide. Now what?
Clean Up Your Own Mess: Voluntary Retraction
This matters because it’s not unusual. Much of the medical and scientific literature is this bad. And the publication record seems essentially un-correctable. The window for letters closed four weeks after the article’s publication date; so much is published, so fast, that most mistakes are bound to be left standing in this typical window.
Journals don’t retract articles anyway, except in very exceptional circumstances. If Goodman and Greenland couldn’t get their critique of J.P.A. Ioannidis’s infamous 2005 article published in PLoS, what hope do the rest of us have? My efforts to address a truly abysmal meta-analysis that I think puts newborns at risk of death or disability convinced me that trying to right this type of wrong through formal channels is a waste of time. (I folded a much shortened critique into my own article, instead.) Who wants to spend their life playing a losing game of pseudoscience Whack-A-Mole?
A published critique should be more comprehensive, anyway — and thus more time-consuming for the critic. Here, it should discuss why it’s not ok to look at female suicides stratified by age but not by reproductive status, when pregnancy appears hugely protective, and pregnancy loss — including abortion — appears quite risky. If what we really care about is the causal effect of abortion care on suicides, then attending to this variable is the place to start. It’s also important to consider non-collapsibility given the increasingly small sample sizes in different sub-categories in this type of matching framework. And of course, we need a DAG (causal diagram). Using it, we want to ensure the analysis doesn’t introduce collider stratification bias by stratifying on a confound that may affect both the exposure (abortion restriction laws) and the outcome (suicide). For instance, this model’s economic variables (state GDP growth and state unemployment rate) are possible colliders, as political leaders may use culture war tactics to manipulate poor people into voting against their interests, and people under economic duress may also be more likely to experience distress.
The bottom line is that science is error-prone and not self-correcting. It’s a social activity done by human beings who make mistakes, bring perspectives, and respond to perverse incentives that select for bad science. And unpaid reviewers who perform often nominal quality control checks. And journal editors who appear to care more about saving face than saving lives, employed by publishers that care more about making money than making knowledge available. (I wanted to include screen shots of the cited article and supplement pages, but was afraid of the publisher.)
The funny thing is, at present, scientists seem to be looking to journal editors and publishers to right wrongs. As if they were some sort of moral paragon. We know they’re not.
Cut. In this scene, scientists will be playing the role of truth-seekers who police themselves when they fail to adhere to current methodological standards. When you publish something wrong, you retract it. Why not?
It’s viewed as socially unacceptable to say when you made a mistake, even though that’s part of life. We all make mistakes. What’s right — methodologically and morally — and what’s perceived as socially acceptable — conflict here. The funny thing about taboos is, most people can’t bring themselves to break them for fear of social repercussions. But observing them sometimes has even worse social consequences. These losses are just normally socialized in this context, as when critics follow norms that say the journal at fault gets to mediate what is said about the publication record. Meanwhile, the costs of contesting bad science are privatized on the people who know and care enough to contest it. And the costs of bad science itself are privatized on the ordinary people whose lives may be affected by it.
Women’s lives are not worth less than researcher’s careers, or what journal editors or colleagues think is appropriate.
Maybe, for scientists on both sides of these discussions, it’s worth doing the experiment of doing the right thing for the right reasons and saying so. Dialogue between reasonable, well-meaning people need not be institutionally mediated. People should clean up after, rather than profiting from, their own messes. We’re all adults here. Let’s be excellent to each other.
“We need less research, better research, and research done for the right reasons.” Part of that “less research” needs to come from fewer bad studies staying uncorrected in the scientific record. The ghosts of science past are too numerous to vanquish. But at least we can say their names out loud in the light, and see if some of them disappear.
Great article! If you haven’t seen this, it’s great: https://open.substack.com/pub/experimentalhistory/p/the-rise-and-fall-of-peer-review?utm_source=direct&utm_campaign=post&utm_medium=web