Test Iterativity, Multiple Causal Mechanisms, and Medicine
Does my Fienberg critique really have traction in healthcare?
“If you want to destroy my sweater
Hold this thread as I walk away.”
— “Undone (The Sweater Song),” Weezer
Lately, it seems I’ve been breaking my favorite toys, from Fienberg’s argument against polygraph screening at National Labs, to its logical extension in the most popular scientific argument against mass surveillance (e.g., EU’s proposed digital communications scanning program, “Chat Control”).
I like these arguments a lot, but I like to do better science even more.
So now I’m pulling the thread on another set of applications of the same argument: the Harding Center for Risk Literacy Fact Boxes, the gold standard in risk communication. I like them a lot.
Are they broken? Does diagramming the missing causal logic suggest that even excellent tools like these fact boxes might mislead in key ways?
Recap
The basic argument is that Bayes’ rule dooms mass screenings for low-prevalence problems to backfire under conditions of rarity, persistent uncertainty, and secondary screening harms. Because we’re trapped in a probabilistic universe, so the accuracy-error trade-off forces what Fienberg/NAS called an unacceptable choice between too many false positives (leading to system overload, human carnage, and resource reallocation costs), or too many false negatives (which renders the program effectively useless).
The problem with this argument is that it doesn’t consider causality first. Drawing the missing causal diagrams shows that test iterativity and multiple causal mechanisms complicate the analysis.
One cannot conclude these programs don’t work (1) based on a one-off application of Bayes’ rule in an iterative testing context, (2) accounting only for the test classification causal effects (accuracy, error), without considering the strategic behavior (deterrence, evasion, sabotage) and information effects (elicitation, suggestion) that are also in the picture.
To do so would make a chain of logical errors linked (as often) by uncertainty aversion. It would be more accurate to say that we don’t know how these effects net out. So maybe polygraph programs actually would have helped keep spies out of National Labs, and the National Academy of Sciences accidentally undermined national security by stopping DOE from implementing the screening en masse.
It remains an important contribution, however, to warn people about the base rate fallacy and educate them on how to guard against it in assessing the net effects of programs of this structure. Similarly, we have to think hard per Fienberg and NAS about validating mass screenings, and what it means that it’s not clear that this is currently possible in a number of contexts (e.g., tests to detect deception and misinformation, when they are used strictly as classifications and not also as tools to manipulate strategic behavior and exert information effects).
I would still like to work on extending and generalizing these elements of Fienberg’s argument. For instance, I’ve been trying for a while to build a Bayesian outcomes simulator letting people see accuracy and error spreads for programs with varying accuracy, base rate, and population inputs. (It could get complicated.)
But, right now, I just can’t help continuing to pull this thread…
Harding Center Fact Boxes on Mass Screenings for Low-Prevalence Medical Problems
There’s always a good excuse to one-up these. Especially the ones on mammography screening for early breast cancer detection, PSA testing for early prostate cancer detection, and non-invasive prenatal testing for specific genetic disorders (“NIPT”; e.g., Down’s Syndrome).
Everyone in modern, Western countries gets offered or knows people who get offered these tests. They’re considered routine preventive medicine. They’re also considered dangerous.
“Get tested!” stock public health and pop sci advice goes.
“Maybe not!” suggests the risk-benefit spread.
Mammo
Mammography, for instance, confers no established net mortality benefit. It does appear to confer a small breast cancer mortality benefit, but no proven all-cause cancer mortality benefit.
This doesn’t mean that it doesn’t save individual lives. But it does mean that we don’t know if it net costs or saves lives. And it doesn’t appear to have a very strong effect either way, as is often assumed or claimed by its proponents.
This has been known for a long time, deals with perhaps the best empirical evidence base of any medical intervention ever, and keeps being replicated, most recently in a meta-analysis of randomized clinical trials which found that mammography may have decreased women’s lifespan by up to 190 days or increased it by up to 237 days (95% CI; Bretthauer et al, JAMA, 2024).
Reasons to suspect that mammography may actually cause net loss of life (not to mention quality of life) include serious risks of harm from overdiagnosis as detailed by numerous critics including H. Gilbert Welch et al, Susan Bewley, and Peter C. Gøtzsche. Those possible harms, highlighted in a recent DCIS trial’s preliminary report and preregistration description, include more radical interventions accidentally spreading cancer — making it more invasive — and very high chronic pain levels at surgical sites (25-68%) 4 and 9 months after breast surgery (Hwang et al, COMET Trial preregistration description).
Net benefit should arguably be demonstrated according to accepted scientific evidentiary standards before mass screening programs are implemented. And screenings like mammography that still can’t demonstrate such benefit after decades of extremely well-funded and designed research should arguably be stopped.
But perverse incentives seem to keep the programs going; everyone wants to do something to fight cancer, no one wants to be seen defunding the fight. So all we can do is tell our friends this seems to be the state of the evidence, and ask them to ask themselves…
If you don’t know, why should you go?
The Harding Center Fact Box gives women the tools to ask and answer that question themselves. There are academic debates about whether people have a right to not choose this sort of thing. Isn’t it kind-of risky, and a waste of their precious time and mental energy, if we know there’s no established net mortality benefit?
Sure, but politics seem to make it too hard to gut the programs. Too many people are too afraid of breast cancer to give up the best mass screening tool we have against it, even if that tool is uncomfortable and might actually backfire. Sorry, people are irrational. Just don’t get your own boobs smushed if you don’t want to, and tell your friends it might not be such a great idea.
PSA testing
Similarly, the same recent meta-analysis found that prostate-specific antigen testing may have decreased lifespan by up to 37 days, or increased it by up to 73 days (95% CI; Bretthauer et al 2024). The quality of life costs of prostate cancer overdiagnosis can also be very heavy, including not just risk of infection but urinary incontinence and impotence. This thing can wreck marriages. It has no proven net mortality benefit. To lot of people, if they know the numbers, this will not make sense.
Non-invasive prenatal testing (NIPT)
Same principle, different population — but now the endpoints may include abortion and its associated possible risks to women’s health as well as to future pregnancies/kids (e.g., preterm birth, placenta previa). No one is following the causal chain out that far, likely because doing so would raise sociopolitically sensitive questions.
In particular, there is a sea of abortion misinformation to counter (e.g., providers tell women it’s net risk- mitigating and possible harm-free, but that’s not established). No one wants to counter it, because it gets coded as an attack on abortion access. But there are multiple more steps in the chain to follow out to do a full cost-benefits analysis here, and no one seems to have done it.
This is the only Fact Box I’ve seen that is already broken on its own terms. Unless we think women only do NIPT to know if their kid may have Down’s or whatever, but not to consider elective abortion if they do. Denmark has good data on this (as the Scandinavians usually do): high uptake of NIPT has led to a near-disappearance of Down’s syndrome kids there, because the vast majority (over 95%) of Danish women carrying a (possibly) Down’s fetus apparently choose to abort. There is, however, a minority who do the test purely for informational purposes, with some women choosing to carry affected pregnancies to term.
Ok, not everyone is Danish. Still, it seems important to estimate the possible risks of abortion when we talk about the possible risks of the test.
The Origins of These Models’ Brilliance
Because I am a nerd, I wish Fienberg, Gigerenzer, et al would show their work in footnotes so that everyone would know the logic and genealogy of these things at first glance. The structure of the analyses come from applying Bayes’ rule, a maxim of probability theory that tells us how to update our priors based on new information. We should especially think of Bayes when we’re thinking about weirdness (e.g., looking for spies or cancers), because it tells us how weird people are weird, mathematically speaking.
Applying Bayes’ rule one time to produce estimated outcomes, like Fienberg did in the NAS polygraph report and Harding does in the Fact Boxes that I like so much, is standard. It helps people correct for the base rate bias, and see that the common (false positives) overwhelms the rare (true positives) in mass screenings for low-prevalence problems.
Putting the outcomes in frequency format tables like Fienberg and Harding do helps people correct for that bias best, intuiting how to make better Bayesian statistical intuitions without training. This hack comes from Gigerenzer & Hoffrage 1995.
These Are Very Nice Models; But Are They Broken?
Yes. They are broken on a continuum of brokenness.
Mammo
The least broken is the mammography model. It’s broken because it needs to consider test iterativity. Especially if we are worried about possible iatrogenic effects of the procedure itself (e.g., from compression and radiation) and from follow-up procedures (e.g., needle biopsy or surgery), we need to see estimates of how these risks add up when the test is repeated at length. This is especially logically important as the possible benefit doesn’t compound (a “clear” mammo at Time 1 doesn’t accrue to a “clear” mammo at Time 2), but the risk of possible harm may (e.g., radiation exposure is cumulative).
One possible counter-argument is that available outcome estimates already factor this all in by looking at evidence on net mortality outcomes over several years. That ignores causality. It might matter to an individual woman making a decision about whether to go for another mammo or not to know that the risk of possible harm compounds when the testing is repeated. So if the point of the model is to help that woman, then the model could be improved by being redone to consider causality first.
Another possible counter-argument is that the compounding possible iatrogenic harm from radiation alone is generally estimated as being relatively small. Still, if this is about informed consent, then we want patients to grok the risk before undergoing the procedure.
PSA
Moving on, the second-most broken is the PSA iteration. It’s broken because it needs to consider test iterativity along with the qualitative question of how people use the test information. PSA testing by itself is zero-risk (unless you really hate needle sticks and count that). It’s the follow-up prostate biopsy that you have to worry about possibly landing you in diapers and divorce court.
So, what if you use the test for informational purposes instead (h/t my stepdad)? What if you do a routine screening to have a baseline? That way, you know if it jumps — when it jumped. You know then to discuss with your doctor whether there could be other possible causes to investigate (e.g., UTI). And, if other causes don’t pan out, you can consider your then-current age and health status when considering whether you want to undergo further prostate cancer screening that might entail risks you don’t want to take at that stage. Prostate cancer often grows slowly, so some men might not want to take any risks diagnosing/treating it, depending on their age and health status. But younger and/or healthier men might choose further testing if they spotted a PSA spike on routine screening.
This insight may require a different presentation form than the current fact box takes. On the basis of their information framing, it looks to me like the risk-benefit balance tips clearly toward risk. But if the test is used in an informational, iterative way instead, then the balance flips the other way.
This problem has shades of Wittgenstein’s rabbit-duck drawing, and Greenland’s application of its implications in the interpretation of regression curves and other statistical evidence. The problem is that someone has to interpret the evidence, and the fact boxes make it seem like the reader gets to do that — but this presentation of the evidence has already done a great deal of interpretation to begin with.
NIPT
The most broken of my favorite Harding fact boxes is the NIPT version. This may be surprising because, here, we don’t have to worry about test iterativity: it’s one pregnancy, one NIPT. But, as discussed above, the end-points need to be followed out farther to account for possible risks of abortion if the test (and/or follow-up, invasive testing) is positive. This looks like an error of sociopolitical circumstance.
What Matters and Who Cares?
On one hand, considering causality first and running statistical analyses later would improve these sorts of models as a matter of science. That should be done for its own sake.
In the case of mammography, it would deepen informed consent to let women know that the test may incur cumulative risks. In the case of PSA testing, it might change how men and their doctors choose to use the test, to consider that there are different approaches to this — and one is much less risky than the other. And in the case of NIPT, it might change how women weigh the possible risks if they had more and better information about the uncertainty surrounding possible abortion harms to their health, and that of future pregnancies.
On the other hand, improving these models to catch up with the causal revolution might not actually change the net cost-benefits estimates much. In the polygraph and mass surveillance cases, we can clearly say that critics have broadly failed to consider the relevance of equilibrium and effects. We should expect these effects to matter, and should insist on their inclusion in relevant analyses.
But, in the medical context, there’s not an obvious story for why we should expect the additional identified causal mechanisms (beyond test classification) to matter in the big picture. We don’t expect mammography to deter breast cancer, or for patients to confess they have a lump when the radiographer asks them. (Though, if you have symptoms, you should 100% report them to your doctor. Targeted screening is a totally different context than mass screening. It probably makes sense to get it done fast.)
The issue of cumulative radiation risk in mammography is a practically small one in terms of effect size that gets into subgroup tailoring (women with larger breasts are at higher risk). The issue of iterativity and interpretation in PSA testing is really a qualitative one of good communication and individually tailored risk assessment. The issue of missing endpoints in NIPT is really a sociopolitical one where no one wants to touch abortion misinformation. None of these problems is arguably about an entirely missing level of causal logic that we would expect to net change most people’s cost-benefits analysis, though they are all problems related to missing causal logic.
So it looks like mass screenings for low-prevalence problems in issue areas that deal with rule-abidingness (security, education, publishing) are probably worst-affected by prevailing models’ missing causal modeling. The best available analyses of programs of this structure could all be improved by incorporating this relatively recent methodological advance. But this suggests there is a great deal of remodeling work to be done.
Meanwhile, the apparent irrelevance of equilibria and information effects to the medical case studies here (if correct) suggests that such improvements may be easiest in medicine and more important in security. In other words, the models dealing with programs that are arguably the highest-stakes for society also seem to require the most extensive and computationally complex revisions.