Mass Screenings for Low-Prevalence Problems in the News
Chat Control, Gospel + Lavender, TRANSFORM: case studies of a dangerous structure
EU digital surveillance program Chat Control is still coming for your dick pics. The IDF allegedly used AI to identify up to 37,000 airstrike targets — contributing to roughly the same number of civilian wartime casualties. And, in the UK, the NHS is backing research that will send roving MRIs in vans (“scan-in-a-van”) to target black UK men over 45 for prostate cancer screening — though it’s not clear the risks of available diagnostics and treatment outweigh the harms.
These disparate programs all employ the same potentially dangerous mathematical structure: They’re mass screenings for low-prevalence problems. Such programs are increasingly common across diverse realms, from security to education to health. Technological advances make them easier to implement, and techno-solutionism makes them seem like win-wins — improving accuracy and efficiency in pursuing agreed public ends like security and health. Who doesn’t want to get pedophiles and terrorists? Who’s on Team Cancer?
Yet these programs share a structure that often dooms them to backfire according to the implications of probability theory (see, e.g., this previous analysis and related talk/slides). The problem is that, under conditions of rarity, persistent inferential uncertainty (not being able to assess for sure which classification is right), and secondary screening harms, they can do more harm than good.
This net harm risk is especially salient when accuracy rates are likely inflated due to perverse incentives, error rates are deflated due to bad actors gaming the system (as is common in security contexts), and reverse causality is in play (i.e., the screening/intervention can cause accidental harm to the outcome of interest it’s meant to benefit, such as security or health). But base rate bias — a common cognitive bias implicated by rarity — tends to render this net harm risk invisible. The invisibility of this risk, left unchecked when accepted scientific evidentiary standards are not applied, in turn makes this a dangerous structure.
Applying those standards would require transparent, independent evaluation of evidence of claimed costs and benefits to establish that new interventions do more good than harm. It would also highlight that many programs of this structure are doomed, because we often cannot solve the scientific problem of test validation under real-world conditions. We often also struggle to measure secondary screening/intervention benefits and harms.
Still, not all programs with this structure are doomed to backfire. Some are wildly successful, benefiting the public interest tremendously. For instance, universal HIV screening of pregnant women, even in places where the incidence is quite low, prevents infants from getting HIV. The screening works because, despite the problem’s rarity, further testing can disambiguate true from false positives with minimal secondary screening costs/harms. Similarly, England recently eliminated vertical transmission of hepatitis B with such a successful mass screening program.
Other recent instances of this type of program in the news are less encouraging. This post briefly looks at three case studies of this dangerous structure…
***
Chat Control redux: Zombie mass surveillance proposal won’t die
This coming Monday, Sept. 23, the EU Council will meet again on “Chat Control,” a proposal to scan digital communications using AI for child sexual abuse (leaked Presidency proposal text here). Previously, as Pirate Party Germany MEP Patrick Breyer wrote:
In June 2024 an extremely narrow “blocking minority” of EU governments prevented the EU Council from endorsing chat control. Chat control proponents achieved 63.7% of the 65% of votes threshold required in the Council of the EU for a qualified majority.
On Sept. 4, the governing Hungarian EU Council Presidency then introduced the new proposal seeking a compromise with the European Parliament’s considerably more privacy-friendly version. Its concessions include limiting detection to known abuse material, and only flagging new material and grooming for risk assessment and mitigation purposes. Italy, previously part of the blocking minority, subsequently switched sides.
The proposal’s long-term aim remains to introduce more surveillance measures when better tech eventually exists to address the program’s fatal flaw: Due to the rarity of the problem, false flags will vastly outnumber correctly classified abuse cases. Trying to resolve the uncertainty by disambiguating them will unfairly target huge numbers of innocent people including minors. Secondary screening harms will likely result to these people when governments and service providers try to disambiguate true from false positives. These efforts are also likely to use finite investigative resources needed for targeted investigations (see previous analysis — 1, 2, 3, 4). Proponents’ unchanged big-picture vision of the program is thus still doomed to backfire, undermining online child safety more than it benefits it — while undermining privacy for all.
As Maïthé Chini reports for The Brussels Times:
In practice, this means that the negotiations currently taking place concern a toned-down version of 'chat control', but with the aim of introducing the measures in the long term when the technology is more up-to-date.
This misguided techno-solutionism fails to recognize that the dangers of this program stem from the implications of universal mathematical laws that technology can not escape. And that those implications spell massive possible damages precisely for the ends (security) and people (minors) the proposal is intended to benefit.
It also ignores two dimensions of corruption that color the issue. The first is Chat Control’s corruption scandal wherein tech companies that stood to gain huge profits were deeply involved in writing the proposal — as well as in orchestrating the appearance of public support for it, including by using child abuse survivors for this purpose. The second is the history of surveillance apparatus intended for one use being extended to serve other (political) purposes. As critics like Breyer observe, the blanket surveillance required to scan digital communications for any one thing creates the infrastructure — including undermining encryption — to scan for others.
Current and historical concerns about corruption in mass surveillance, as well as the zombie nature of the proposal despite its fatal flaw, raise the question: Is it naive to think the statistics part of the story can change the world?
The optimistic story one could tell here is that proponents of mass surveillance à la Chat Control risk massive damages by way of implementing a dangerous structure (mass screening for low-prevalence problems) — without understanding the implications of probability theory for the properties of the screening they want to administer. If only the good king understood what his people were going through in reality, instead of being misinformed by bad advisers, then he would fix things — the loyalist mythology goes. A chicken in every pot and end-to-end encryption on every phone.
Alternatively, some critics question whether the goal of mass security screenings for low-prevalence problems is really advancing security, as opposed to:
increasing state power (the more discretionary, the better),
contributing to instability including war that may keep incumbents in office,
appearing to make fallible, discretionary decisions on neutral, objective grounds, potentially supporting soft power,
making money for corrupt officials and their friends, and/or
other other extraneous purposes.
These narratives aren’t necessarily mutually exclusive. State and other institutional decision-making bodies tend to include multiplicities of interests and psychosocial perspectives. Maybe it is by emphasizing science that we can best counter bias and corruption alike. It worked for Stephen Fienberg with DOE polygraph programs; maybe it could work again.
***
The Gospel of Boom: Some IDF target selection decision support systems implemented the dangerous structure
Earlier this month, Israel signed the first international treaty on AI, which excludes national defense and security — typical exemptions in treaties and EU/domestic legal regimes alike. In spite of these exclusions, Middle East Institute Migration and Technology Fellow Mona Shtaya wrote this was “deeply troubling” because of Israel’s use of AI in its current war. In effect, Israel stands accused of running mass AI screenings for the low-prevalence problem of terrorism — and blowing up a large number of its false positives.
According to April 2024 media reports (972+, The Guardian), some in the the Israeli military let AI-powered decision support systems select thousands of targets — places with the Gospel, people with Lavender — to bomb with minimal to no effective human oversight, especially in the early days of Operation Iron Swords, its ongoing response to Hamas’s October 7 attacks near the Gaza Strip. Israeli international law experts responded that “the process mentioned in the 972+ article is a very preliminary one in the chain of creating, and authorizing, a military target,” which is itself only one decision in a lengthier decision-making process going up the chain of command from intelligence officers to other experts and more senior officers. Similarly, the IDF responded that these systems are decision support tools used in accordance with international law to help analysts selecting targets improve their accuracy and efficiency by making evidence-based suggestions. West Point Distinguished Scholar Michael Schmitt also emphasized Gospel and Lavender are “decision-support systems that can significantly enhance LOAC [Law Of Armed Conflict] compliance.”
Nonetheless, earlier this month, longstanding mass Israeli anti-war protests grew into the hundreds of thousands. Protesters have continually called for a deal to return the remaining hostages, with some critics charging Israeli PM Bibi Netanyahu is pursuing his own self-interest at the hostages’ expense by waging this war in this way. If the law of targeting requires, as Schmitt says, insuring targets’ direct connection with the military opponent, taking feasible precautions to minimize harm to civilian targets, and proportionality, then one way of reformulating widespread internal and international criticism of Operation Iron Swords is that it violates this law. (Granted, like all wars, this one involves heterogeneous targeting, some of it — like this week’s coordinated pager and walkie-talkie explosions — apparently extremely direct, harm minimizing, and proportional.)
At a higher level, this discourse highlights a few typical claims and counterclaims about what AI does (and doesn’t do) for experts. Critics often claim the tech makes the decisions because it makes classification calls that drive next steps. Meanwhile, proponents often claim this is false, because there’s always a “human in the loop.” There are competing claims about how the military (or border patrol, medical systems, academic publishers, etc.) really uses these systems on the ground. (Sometimes, the same individuals even make these competing claims.)
Outside observers can’t know the truth-value of these competing claims, when we can’t assess on the basis of primary source evidence how autonomous versus complementary the systems’ use really is — either systematically or individually. But we can observe that institutional norms often tend to create pressure for agreement with the AI’s decision, at least at the level of further investigating flagged cases. And that this tendency toward agreement and secondary screening may itself blur the line between AI suggestion and expert decision.
We know that confirmation bias (thinking information confirms your preconception) is pervasive and powerful. Some of my dissertation research suggested that it affects technology-mediated security decisions. It would be surprising if it didn’t. One of the implications of this is that there is not necessarily a neat dichotomy between tech making the decision qua classification, and having a human in the loop.
Similarly, critics claim the bulk nature of such screenings, combined with the rarity of the problem, means that false positives likely vastly outnumber true positives — while non-negligible numbers of false negatives also persist, in the inescapable accuracy-error trade-off. This might cause these sorts of systems to net degrade security by, for instance, generating terrorism through leading to the killing of large numbers of innocent civilians, while also missing key terrorist targets.
Meanwhile, proponents argue accuracy and efficiency gains result precisely from the blanket nature of the screenings. Again, there are competing claims about the systems’ net effects in this regard. And again, outside observers can’t know the truth-value of these competing claims. (Insiders can’t either, if it’s not possible to validate the tests through resolving inferential uncertainty, or measure secondary screening/intervention harms.)
We don’t have to. The core criticism is not “we know that these programs backfire.” Rather, it is “we don’t know whether they cause net benefit or harm.”
On one hand, the accuracy-error trade-off stems from universal laws of mathematics that no technology or person can escape. It means we are stuck with these sorts of programs making either too many false positives, or too many false negative classifications, because decreasing one type of error increases the other. Both types of errors have repercussions, in this case for Israel’s security.
On the other hand, experts with inside knowledge often seem to claim precisely that their implementation of these programs or their actual use of these sorts of technologies does escape this trade-off. They claim to do this by using these tools on the ground as complementary resources, speeding identification of potential cases to investigate (some true, most* false flags). They claim to do this while also investigating cases not flagged by these tools (false negatives) based on other information. (*We don’t know what the numerator, denominator, base rate of terrorism, or other specifications are in these particular systems. But, assuming terrorism is a low-prevalence problem in this context, like other mass screenings for low-prevalence problems, we would expect them to generate overwhelmingly false flags even with an accuracy rate of 90%.)
Critics counter-claim that investigative resources are finite. Analysts charged with sorting vast numbers of false positives cannot use the same time and energy to find the false negatives, compromising potential net security gains from accuracy. And blowing up more false positives faster doesn’t accrue net security gains from efficiency, if it generates more terrorists.
In the bigger picture, this type of argument seems to be a disagreement between scientists and institutionalists. It’s ironic, because triangulation of observation, analysis, and interpretation of different sources and methods is perhaps the most respected research design strategy of modern applied and social science.
But scientists are after triangulation of combinations of valid, if imperfect, methods. Triangulating unvalidated tools — or, more often in security and other contexts, tools that can’t be validated in the real world — with valid ones, does not strengthen the evidentiary basis of decision-making.
***
TRANSFORM: UK prostate cancer screening trial implicates the same structure
On Sunday, The Guardian Science editor Robin McKie uncritically reported on the TRANSFORM trial (“New screening trial could save thousands from prostate cancer,” Sept. 15, 2024). The article reads like a press release for the trial team. Beyond the optimistic title, which echoes the (involved institution) Imperial College London’s May write-up based on materials provided by the sponsoring Prostate Cancer UK, there’s a lot of unreflective hype in this piece.
The history of mass PSA (prostate-specific antigen) testing for cancer should give us pause. The Harding Center’s synthesis of the best available evidence suggests the evidence cannot establish a net mortality benefit from this (previous gold-standard) screening. Meanwhile, it suggests a false positive incidence of 155/1,000 and a non-progressive prostate cancer diagnosis/treatment rate of 51/1,000. That is, for every 1,000 men screened, 155 were wrongly scared by a misleading result — and usually had tissue removed unnecessarily to disambiguate true from false positives. Even among the true positives then, 51 of those originally 1,000 screened men turned out to have non-progressive prostate cancer — so they didn’t benefit from diagnosis/treatment.
The unnecessary prostate tissue removal that most of those falsely flagged 155/1,000 underwent risks incontinence and impotence — details the news article and the trial website itself fail to mention. Such interventions also carry rare but serious risks of infection. As the Harding Center fact box says, “biopsies to clarify suspicious test results are associated with a risk of hospitalization and death. Unnecessary diagnoses and treatments increase the risk of heart attacks, suicide and death from treatment complications."
It is unfortunate that the news article does not link to this fact box or quote anyone from the UK National Screening Committee, which it notes “has refused to give the go-ahead for a national prostate programme on the grounds that it would do more harm than good.” Instead, it follows that acknowledgment with the following statement:
“That has to change,” said Prof Hashim Ahmed of Imperial College London. “We have to be in the position that we are with breast and cervical cancer when a woman is invited by her GP to have a mammogram or a cervical smear. By contrast, prostate cancer testing is irregular, patchy and unsystematic.”
The article does not mention that Prof. Ahmed is one of the six lead researchers on TRANSFORM. Nor does Ahmed acknowledge the widespread condemnation of mammogram breast cancer screening by leading evidence-based medicine reformers including Gøtzsche and Bewley, or the literature on harm from cervical cancer screening overuse. Experts do not agree that we want routine prostate, breast, or cervical cancer screening, as Dr. Ahmed implies.
TRANSFORM plans to recruit hundreds of thousands of men to pursue its £42 million research program. How do researchers plan to inform these subjects about the possible risks and benefits of participation — especially given that they are unknown for new combinations of diagnostic methods, and may reproduce the arguable net harms of the last diagnostic generation’s PSA testing? How does this research differ from the NHS’s much-criticized 2022 prostate cancer search scheme? And why does the research team appear to believe it’s a good thing to leave key variables as yet unspecified?
In the trial website promotional video (around minute 1:30), Laura Kerby, Chief Executive of Prostate Cancer UK, says “Crucially, the trial has been designed flexibly and will be able to incorporate promising new testing methods at any stage of the process.” There does not appear to be a related ClinicalTrials.gov entry. Preregistering clinical trial designs has been one of the most important advances in modern open science. It serves to prevent misreporting of exploratory research (such as this) as confirmatory, reduce publication bias and false positive results inflation, and prevent outcome switching. It’s a transparency gain that benefits science in the public interest. But not all of medicine has been so transformed.
TRANSFORM is exploratory research supporting a big-picture vision of mass screening for a low-prevalence problem. It risks causing net harm to its research subjects. This is why leading oncologist Paul Cornes, speaking to the BBC in August, called for more information about TRANSFORM and said he himself would “ ‘think twice’ ” before participating in the trial. All of which raises additional concerns about the ethics of targeting black men in particular for participation. As most researchers conducting human subjects research are aware, there is a long history of unethical medical research on minority racial/ethnic groups, from Mengele to Tuskegee. But still it can be hard to get meaningful media or institutional attention for any one possible resonance of such egregious historical injustices.