You Say Tomato, I Say Bayesian Search
A plural, contested concept. A few different ways to model iterated screening. A conceptual question about signal detection problems.
Like so many terms, “Bayesian search” can refer to a number of different literatures and acts. The question that got me interested enough in it to open a gazillion tabs (a low bar): How do we better use the information we’ve got in iterative screening contexts?
Better than the standard one-off application of Bayes’ rule we see in places like the NAS polygraph report (see Table S1, below, as usual) and Richard McElreath’s vampire test parody of that usual medical test presentation?
The problem is that we want to keep the insight that the common overwhelms the rare, accounting for the base rate bias that all too often blinds us to the false positive problem… While also incorporating the additional information that we might get from iterative testing in some contexts. What contexts?
Running the same blood test twice is dumb; we need different secondary screening tests to disambiguate true from false positives in infectious disease contexts. But running the same mass telecom surveillance over time is what mass surveillance does; and we would expect such programs to incorporate information about past test results in order to help disambiguate true from false positives in contexts like proposed EU digital communications scanning program Chat Control. This would at least be one logical way to try to identify truly abusive patterns in AI scanning for child sexual abuse material, to mitigate the false positive problem. So how might this work?
My naive intuition was that it might implicate a different Bayesian statistical paradigm, like Bayesian search. Whatever that means…
This post argues the simple one-off Bayes application may not be ideal for iterated screening contexts, explores a few different possible meanings of an alternate “Bayesian search” paradigm, and suggests multilevel modeling offers a more promising alternative.
Bayesian Search, Take One: Dude, Where’s My Sub?
Originally, Bayesian search comes from a seed of an idea Bayesian decision theorist Howard Raiffa gave Navy scientist John Craven. The idea is to optimize parameters continually to improve classification iteratively in contexts like searching for missing objects (submarines, planes, scruffy bears…). This is about resource allocation under uncertainty (Bayes’ alert).
How did Craven succeed in locating a missing nuclear submarine, the USS Scorpion, using this method in 1968?
He started with qualitative work on causality, interviewing experts about what might have happened.
He then worked with an acoustics expert, Gordon Hamilton, who used a recording of the possible signal of the sub’s implosion from water pressure as it went down in order to construct a search box within which the wreck was located.
Putting together expert-informed logical thinking about causality with empirical information worked.
On one hand, mass screenings for low-prevalence problems are also about resource allocation under uncertainty. We also want to put together expert-informed logical thinking about causality with empirical information to make them better, probably, if we can. Maybe? We’re not getting rid of them, anyway. There are too many guild interests promoting them, the false promise of techno-solutionism is too tempting, and so maybe even staunch critics should agree to just try to stem the tide of their damages by hacking them to be better where possible; or at least anticipate how others could do that and then counter the attempt.
On the other hand, we’re not trying to locate objects in space in the vast majority of these programs. We’re trying to find the baddies. So it’s not clear that Craven’s Bayesian search method maps onto signal detection problems more broadly. There is no big bang from a regular Internet porn user sliding into child sexual abuse imagery (part of the problem Chat Control is meant to flag), or from an early-stage cancer seeding lymphatic tissue in the process of metastasizing (part of the problem mammography is after).
That’s the whole problem in a nutshell: we want to detect signals where there is noise. If we had a Holy Grail of the signal (like a giant bang from a missing sub imploding from water pressure on its way down), then we wouldn’t be in this mess.
So maybe there is something here that applies there, but I don’t see it. It could still hold promise in niche situations where we’re looking for one specific thing, like in finding a mole in an intelligence context. But (I think) there would have to be enough data to transfer Craven’s probability density functions to nonphysical probability distributions. This could probably be done, but that’s probably not how anyone in intelligence probably does it or would want to do it. And, more importantly, I wouldn’t know how to either make a toy example of it to play with, or validate a real example of it given data access I’ll never get. (All-cause mortality for spies under polygraph and no polygraph programs, please?)
Never one to scramble out of a good rabbit hole without at least a shiny pebble to show for it (and some dirt in my hair), I seem to have identified another group and approach with which the term is also associated…
Bayesian Search, Take Two: Cau$al Discovery Algorithm$
There is a well-known Carnegie Mellon group, program, and algorithm called Tetrad associated with Peter Spirtes, Clark Glymour, and Richard Scheines (SGS). It uses Bayesian networks to search observational data to infer causal relationships post hoc, aka causal discovery.
This is a terrible idea. It’s like data dredging had a baby with Bayes. Science reform calls out “Causality first!” and it answers “Sure, right after running a gazillion statistical analyses on this data spewing from a firehose!”
Nonetheless, it has gotten recent Department of Defense funding to the tune of over six billion dollars.
Yes, you read that right. Billion with a B. Like bacon.
Smarter people than me have already critiqued this approach. On the quiet side, there is probably a reason Stephen Fienberg didn’t even bother to mention it in his work on polygraphs despite sharing an institutional home (Carnegie Mellon) and statistics expertise/interests with these folks.
On the louder side, in “Are there algorithms that discover causal structure?” (Synthese 121 (1-2):29-54 (1999), h/t Sander Greenland), David Freedman and Paul Humphreys argue SGS-style Bayesian search is philosophically and methodologically flawed.
In a throwback to Craven’s success and what is generally known about thinking logically about causality, F&H argue:
identifying causal relations requires thoughtful, complex, unrelenting hard work; substantive scientific knowledge plays a crucial role. Claims to have automated that process require searching examination; indeed, the principal ideas behind automated causal inference programs are hidden by layers of formal technique (p. 29).
They also claim “there are no real examples where the algorithms succeed” (p. 30), “causation is defined in terms of causation, with little value added” (p. 32), mistakes statistical tests make compound as multiple tests are made (p. 33), “the SGS algorithms must depend quite sensitively on the data and even on the underlying distribution: tiny changes in the circumstances of the problem have big impacts on causal inferences” (p. 33), and they don’t adequately empirically test their algorithms’ success (p. 34) — “If the algorithms work, they work despite failures in assumptions – and if they do not work, that is because of failures in assumption” (p. 40).
So if you were hoping AI/ML would save you from having to talk to people to figure out what’s actually going on, newsflash: it’s not going to happen.
On one hand, I am always happy when I can close one of my embarrassingly many tab colonies without further excavation. I need help aiming the buckshot of my attention span, and if it happens to come in the form of cynical rejection of the usual algorithmic hype amid perverse incentives, so be it.
On the other hand, I am still trying to answer the question of how iterative screenings can better incorporate accumulating prior information from previous screenings, and this doesn’t look like a good rabbit hole to keep exploring. Or rather, it seems potentially well-funded but methodologically poor.
There is anyway a simpler, better-founded path to travel here… (Presumably one among hundreds or more.)
Multilevel Modeling: Now with 200% More Bayes!
Multilevel modeling does what I wish more screening models did: nest layers of uncertainty to better represent repeated or clustered information. So instead of treating each test iteration as isolated, multilevel models let us shrink noisy estimates toward more stable group or prior estimates. This reduces estimation error not by ignoring variability, but by pooling strength across levels.
Following IJ Good, Greenland’s classic 2000 paper introduces this approach as a conceptual and practical unification of frequentist and Bayesian traditions. Greenland defines multilevel modeling (aka hierarchical regression) as a generalization of regression that “distinguish[es] multiple levels of information in a model,” allowing better estimates by borrowing strength across levels (“Principles of multilevel modelling,” Int J Epidemiol, 2000 Feb.;29(1):p. 158 of 158-67).
If what we’re after is wiser use of prior information, this looks promising: “Multilevel analysis allows one to use prior information in a manner more cautious than ordinary Bayesian analysis and more thorough than ordinary frequentist analysis” (p. 165). This judicious use of information pays off in lower expected squared error; estimates are closer to the truth on average when we shrink them toward a plausible prior, reducing a lot of random variation at the risk of adding a bit of bias (p. 159). (So it matters how good our prior information is; good info improves accuracy, bad info degrades it.) This means better estimation (lower ESE, expected squared error) by smart shrinkage.
Sound good? It is. But we can do better! When estimating multiple parameters (e.g., risks in 3+ groups), shrinkage estimators like Stein estimators and Empirical Bayes perform even better — much — by estimating the amount of shrinkage from the data.
“An amazing property of the Stein estimators is that their total expected squared error is guaranteed to be less than that of the conventional estimators, regardless of the values of the target parameters or the prior means” (p. 161).
EB relaxes the need to specify prior means. It lets the data choose the prior, assuming exchangeability (no reason to privilege one group a priori). This yields robustness to bad priors.
Both Stein and EB estimators adjust how much to shrink based on how noisy the estimates are. Again, this is ideal for repeated screening contexts where there’s a signal, we’re just struggling against the noise.
More Bayes more better? There’s also Bayes Empirical Bayes, which improves estimates by (what else) using prior information, and so can yield ESE even lower than those of EB estimators (p. 164).
What Modeling Approach Makes Most Sense for Iterated Screening?
Something is still bothering me about applying Bayes in iterated screening contexts to make frequency-format tables so people can actually grok risks per Gigerenzer & Hoffrage 1995. It’s the disease marker versus planetary composition problem.
In some screenings, the right way to think of the test is like a blood test. The disease marker is in the blood, or it’s not. You don’t get more accurate results if you take it again.
In the mammography context, the boob either has a tumor, or it doesn’t. In the Chat Control context, the messages either contain child porn, or they don’t. There can be classification mistakes either way, but the underlying state categorization is a static, binary thing.
But what if, in other iterated screenings, the right way to think of it is more like, “Is the Earth mostly water or mostly ground?” What if taking more samples makes the answer more accurate? Like we would expect the accuracy of a mass digital communications screening to increase with time if it’s trying to catch a pattern instead of just a one-off offense in the first place.
Maybe that’s not sociopolitically realistic, in the sense that mass surveillance proponents want to catch all the baddies, all the time. But that goal isn’t mathematically realistic, because mass screenings for low-prevalence problems are trapped in the accuracy-error trade-off. More false positives equals fewer false negatives, but overwhelms secondary screening capacities; fewer false positives equals more false negatives, and still has the problem of both types of error. This is the structure of the world.
So maybe we should want a world of mass screenings for low-prevalence problems that update on past results where iterative screening can catch problem trends or patterns better this way. Maybe we should want to see applications of Bayes’ Rule that show what that updating would look like in frequency formats. And maybe there is a simple way to do that that doesn’t involve multilevel modeling, but I haven’t figured out yet what it is.
Of course, all models are wrong. More specifically:
every inferential statistic (such as a P-value or confidence interval) is model based… This limitation… is a severe limitation of every analysis of epidemiological studies of cause and effect because in such studies we almost never know enough about the processes that determine exposure or disease to be even moderately certain that our model is even approximately valid (Greenland, p. 164).
This levels up in truthiness when we extend from epidemiological contexts like breast cancer to sociopolitical ones like texting forbidden stuff (the definition of which is going to vary considerably by regime, states being what they are).
ISO Other Bayesian Searches
So “Bayesian search” is not one thing. Some of its incarnations rely on unrealistic assumptions or nonsensical standards of proof. And if what you want is just to apply Bayes’ Rule in a wiser way than a one-off to learn from past results in an iterative screening context, multilevel modeling offers several options that probably make more sense.
Sorry, I have probably left out a lot of other associations and uses of the term. I did not even try to do a comprehensive treatment of this, and would welcome suggestions for further reading. Also simply for better ways to estimate outcomes in iterated screenings.
The best way forward I see so far is multilevel modeling. It offers a few options for integrating prior information including past test results into future screening expectations. Binary tests yield nicely structured information that we should make use of in estimating outcomes. This seems like an honest way to do that.
But I still need to think again about whether we are really interested in disease detection or planetary composition-type problems when we talk about mass screenings for low-prevalence problems. I think this is a question of how we want to conceive signal detection problems, and the answer might be strategic and context-dependent rather than having a one-to-one correspondence with empirical reality. Do we want to find something as close to the Holy Grail as possible (struggling for the signal amid the noise)? Or do we want to pick up on characteristics of the sound that can help us sketch the signal? When, which, why?