Multiverse Policy Analysis
What might it look like to make uncertainty more explicit in estimating program costs and benefits?
Policy analyses often obscure uncertainty; but what if we made it explicit? This post explores how multiverse analysis might change standard practice, using the polygraph debate as an example.
One of the realizations that has been blowing my mind lately is how much Stephen Fienberg’s application of Bayes’ rule to ostensibly show why polygraph programs at National Labs would backfire was a rhetorical ploy, not an absolute mathematical or empirical reality (see recent posts 1, 2, 3, 4, and 5).
On one hand, that’s not unique to this model. The truism is true that “all models are wrong, but some are useful.”
On the other hand, in hindsight it’s easy to see his mistakes:
Missing causal diagramming
He didn’t do causal modeling to factor in how “lie detection” could work as (1) test, (2) deterrence, and (3) bogus pipeline (people confess thinking it’s maybe part real). He just modeled the test, invoking the accuracy-error trade-off, and concluded both strict and relaxed modes are dangerous. That assumes we know the net effects of three causal mechanisms from modeling only one. We do not.
One-off versus iterative Bayes
He also didn’t factor in how the test could be iterative and even occur in the context of a known espionage problem, so we might want to consider this problem being more analogous to one of Bayesian search rather than applying Bayes’ once and being done with it.
The funny part of these mistakes is that they occurred in the context of Fienberg and the National Academy of Sciences saying, essentially, “I’m making all these generous assumptions (to polygraph program proponents), and still the math says the programs backfire.” This framing subtly misleads, even though the math is (of course) correct, and the cognitive science insights it illustrates are important.
Generous assumptions to proponents would have tried accounting for the other widely acknowledged possible causal mechanisms, and for the iterative nature of the testing both within polygraph programs and as part of larger investigative or administrative processes (with actors updating priors repeatedly in both contexts).
This realization led me to rethink how uncertainty is handled more broadly in policy modeling.
Simple Is Good, but Right Is Better
So the seams of this analysis I have long admired are showing to me now, like the switch under the magician’s table. It’s a funny feeling.
For one thing, I had wanted to neatly reapply this analysis in other case studies of identically mathematically structured programs, mass screenings for low-prevalence problems. Like Chat Control, mammography for early breast cancer detection, and PSA testing for prostate cancer. It seemed like a good way to help make society better through science. I had policy goals in mind that I assumed this kind of analysis would serve, because I thought I knew the answers.
But it doesn’t look like such a simple story to me anymore. It looks like one part of what should be a more complex policy multiverse analysis. And I’m wondering why we never see these instead of one-dimensional cost-benefits analyses. Or do we? Are they not worth doing? Are they done? (Please email me examples.)
Reappropriating the Multiverse Analysis
We need to follow Fienberg’s lead in applying Bayes’ rule to programs that share this common structure in order to correct for base rate bias. Policymakers and ordinary people alike can use statisticians’ and cognitive scientists’ counsel on that to hopefully guard against possible net societal harms from this common type of program.
But it’s not black-and-white that correcting for that necessarily means programs of this structure backfire. Not even, as I had thought, under conditions of rarity, uncertainty, and secondary screening harms. That was probably too simplistic an attempt to systematize Fienberg’s work.
Systematizing means we have to theoretically and empirically consider where it generalizes and where it doesn’t. To do that, we have to take a page (or a bookshelf) from metascience and cognitive science on uncertainty aversion, and model uncertainty more.
It is not enough to just warn people that models are wrong. Arguably, we have to actually attempt to show them how wrong they could be.
As Judea Pearl wrote in Causality: Models, Reasoning and Inference on propensity score matching and other statistical methods:
it is not enough to warn people against dangers they cannot recognize; to protect them from perilous adventures, we must also give them eyeglasses to spot the threats and a meaningful language to reason about them. By failing to equip readers with tools (e.g., graphs) for recognizing how “strong ignorability” can be violated or achieved, they have encouraged a generation of researchers (including federal agencies) to assume that ignorability either holds in most cases, or can be made to hold by clever designs (p. 352).
Graphs are one important tool here. But in Feyerabend’s spirit of methodological pluralism — and Greenland’s warning that these limits apply also to DAGs — I suggest we also think about how to show uncertainty better even in the simplest policy cost-benefits analyses. This might mean, instead of one model estimating program costs and benefits, showing several in a policy multiverse analysis (if it’s not inappropriate to reappropriate Steegen et al’s term here).
This could help address science’s neutrality problem (which is not solveable) by quantifying the implications of different foundational assumptions on deeply contested terrain. Surely there is already a name for this and a literature doing it. But I don’t see that in hyperpolarized discourses like those dealing with abortion, Covid, mass surveillance, and other subjects where one sees both sides characterize the other as engaging in misinformation (and to be sure, sometimes, both sides do).
The idea is that, instead of one cost-benefits analysis, we would expect to see a set of them, starting with one model based on assumptions generous to policy proponents and one based on assumptions generous to critics. But that this would tend to blow up along multiple axes.
For instance, in the context of mass screenings for low-prevalence problems, we may have to worry about multiple possible causal pathways not just for security but also for medical tests which may bring people in for other needed medical care. We may have to worry not just about inflationary accuracy and deflationary error rates, from (respectively) perversely incentivized researchers and dedicated attackers gaming the system, but also about deterrence.
And we may have to ask to what extent individuals (from experts to patients) or organizations are really using these sorts of screenings as one-off tests versus iterative ones. And to what extent that itself may be heterogeneous within groups of actors like administrators or patients. Taking a page from Gigerenzer, maybe people making important decisions in the real world are often not as dumb as researchers tend to think they are; maybe they tend to account for biases just fine somehow, thank-you-very-much.
Varying all of these assumptions within reasonable ranges may produce such substantial variation in estimated hypothetical program effects that even the simplest cost-benefits analysis yields much more uncertainty in net effects than is generally expected or desired. On one hand, this would underscore the need for better data collection to inform evidence-based policy in the real world. On the other hand, it would not serve anyone’s sociopolitical agenda or narrative, and so might be less likely to get done even though it seems more likely to be right.
Fear and Excitement
I’ve always been surprised and a bit scared by my scientific research results. To the horror of friends in industry, who habitually advised me to not do any experiment for which I didn’t already know the answer, and successful academics or statistical consultants who normalized spinning the unexpected as expected and the uncertain as certain, I never know what I’m going to find, and my latest research adventures are no exception. Granted, it looks like it will take a lot of work to find anything out this way. Maybe both of those things mean it’s good science?
Yet, there may not be much of a market for it. Who wants to pay me to perform extensive theoretical and empirical research to possibly conclude we don’t know the effects of a bunch of programs for which we would like to know the effects? (I accept Stripe and dark chocolate.)
It should also be said none of this is really new, even though it still feels new to me somehow. This is just a remix of Greenland on data not talking and Hossenfelder on science not saying anything. But I still suspect there are open empirical questions about the extent of this interpretive bind in various contexts that scientists can answer.