From Math to Matter
Making real-world sense of the theoretical implications of probability theory
Apparently, I once played baseball and ate cake with my beloved older brother Joe. This is of course a very American thing to have done. I’m American, it’s believable, and there’s a photo; so I believe it. But I wouldn’t know if there weren’t a picture in Uncle Bobby’s collection that worked its way back to me. I can’t remember ever playing baseball in my life; but I can’t remember a lot of things. So pictures — small, artificial snippets from a particular perspective — can come to represent objective records of whole things (like my childhood) that they don’t map onto very well at all. Much as I still love Joe and cake.
There’s a similar sort of mapping error going on when people leap from estimated hypothetical outcomes of mass security screenings for low-prevalence problems (a limited representation), to larger things that they might be said to theoretically represent — like values in one breath, and practical results in another. Previous posts argued that this type of program dangerously undermines the very security it’s intended to promote due to the base rate fallacy and inferential uncertainty, so the EU Council should keep the Parliament’s proposed bans on particularly dangerous AI, instead of carving out the usual security exceptions. This post continues that larger argument, focusing on this mapping error.
The error stems from another cognitive bias — alongside base rate blindness and dichotomania — that contributes to widespread misunderstanding of this dangerous type of program. The bias is statistical reification, defined by leading statistics reformer and UCLA epidemiology/statistics professor emeritus Sander Greenland as “treating hypothetical data distributions and statistical models as if they reflect known physical laws rather than speculative assumptions for thought experiments.” In other words, it’s one thing to look back at family photos and imagine how they fit into what else you know; it’s something else entirely to treat them as the backbone of the whole story, or as a heuristic you can rely on to make an important decision.
Far from being limited to discussions about the EU’s AI Act or the type of program at issue here, this sort of mapping error is common in science and science communication across fields. Significance testing misuse is the best-known example of what Greenland calls “The yawning gap between theories and realities of modern statistics.” That’s why Greenland and 800+ other scientists recently called for an end to the widespread misuse of statistical significance test thresholding to (mis)represent practical significance. (And yes, it’s statistics’ fault for making “statistical significance” not mean the same thing as “practical significance,” at all.)
This post deals with the treachery of language and other representational leaps. It argues that most people wrongly map values (liberty and security) and real-world consequences (costs and benefits) onto the hypothetical estimated outcomes of mass security screenings for low-prevalence problems. They do this implicitly, without following the chain of any form of evidence or logic from program outcomes (hypothetical or actual), to these values. This mistake is understandable but bad.
It’s bad because it implies that we can trade some liberty for some security, when that trade is not really on the table. Accepting the framing of that trade anyway is common and that is bad, because it tricks us into first arguing about whether or not we want to take this trade that is not on the table, and second into thinking (if/when we do) that we’ve for sure done something to advance security (giving up some liberty) — when we’ve done no such thing. In fact, we’ve done something worse than the opposite: We’ve given up some security and some liberty in exchange for huge opportunity costs.
Watch the ball.
Trading Some Liberty for Some Security (Isn’t On the Table)
The promise of the AI Act is that states oriented to upholding human rights while promoting market and security interests can strike a balance between these ostensibly conflicting values by banning tech that’s too vulnerable to abuse, legislating tech that’s dangerous but useful, and dividing people into classes of more and less policed (e.g., migrants and EU citizens) to decide when to use high-risk tech. This rings a bell. We often hear about trading some liberty for some security. What’s implied but usually unstated in these discussions is that these are net costs and benefits.
The problem is not that we shouldn’t (necessarily) make such trades. Rather, it’s that they’re not on the table with mass security screenings for low-prevalence problems under conditions of inferential uncertainty. Pretending that they are masks our persistent ignorance under these conditions. We would like to stick that ignorance behind a shiny tech curtain and act like that makes it go away. It does not.
As prominent public statistician and NAS polygraph report co-chair Stephen Fienberg said of next-generation “lie detection” programs like FAST:
… well, at the airport, we want to get the man or woman who wants to get on the airplane and blow it up. I’m all for that. And if I knew a good way to do it, I’d be willing to surrender some liberties if there was really a device that I had some confidence in that worked. I’m willing to allow false positives if I’m really catching people; I haven’t seen the evidence.
There’s a major difference between asking people about something that you can verify and asking them about something that you can’t. I was sitting at the table last night with a group of colleagues from another committee, and they said, “So how would you validate this?” and I said “I don't know.”
That's part of the problem. I don’t know what’s the experiment I could conduct that’s double-blind that would give me enough data to conclude that this works or it doesn’t work, and I don’t know how to plan for it, and I’m not hearing any of these people saying that they’re worrying about that. They’re using it in some trial program, but I can’t imagine they’re using it in the context where I want to see it evaluated, and I find that pretty frightening. - 2009 interview
Trading some liberty for some security isn’t a real option under conditions of inferential uncertainty. That’s why mass security screenings for low-prevalence problems ironically harm the very security they’re intended to protect. I know it may still sound weird. I think this is a language problem… And I’m going to pick on a few examples from credible sources who generally do excellent work that I respect and appreciate, to show that language problems are going on here.
The Math Is the Issue
Most people (agencies, states, activists) accept the “liberty versus security” framing of these debates, but I argue that dichotomy is false and dangerous when it comes to mass security screenings.
Example 1: Kickass European digital rights umbrella org EDRi produced this Chat Control video that begins “In the European Union, there’s a law being negotiated that takes a dangerous turn in the trade-off between safety and freedom.” The video goes on to make excellent points about children and youth vulnerability to police investigation for consensual chatting, the need for more resources for child protection, and the wisdom of investing limited resources in relatively effective protection measures instead of developing infrastructure for a dystopian surveillance state.
But the intro here is misleading. Chat Control doesn’t actually offer a trade of some determined net freedom or range thereof, in exchange for more determined net security or range thereof. Framing the program in these terms suggests to people that they have that trade to make, when they don’t.
Instead, Chat Control’s proposed client-side scanning — in which providers would run an algorithm designed to detect kiddie porn as well as grooming-type abuse on digital communications, before they’re encrypted — would probably degrade both safety and privacy. Especially for kids and teens. This is a vulnerable group that deserves extra protection, not bad policy that makes them more vulnerable.
In the bigger picture, the typical security versus liberty “trade-off” frame in these discussions is inconsistent with the implications of probability theory in mass surveillance. In addition to being empirically wrong, assuming such a trade-off puts critics at a rhetorical disadvantage. It’s fine to argue as a matter of first principle that we shouldn’t take such a trade. Those are great debates. But they’re irrelevant.
The real choice is between hurting security and liberty with mass surveillance for low-prevalence problems, or not. So if you want to promote security, don’t use this type of program. If you want to promote liberty, don’t use this type of program.
Don’t talk about security versus liberty when the math shows there’s no such trade on the table.
Example 2: Signal president Meredith Whitaker recently told Cyberscoop:
I think you need to face it. It’s not like I can’t avoid it or say, “We’re not talking about that, we’re talking about math.” That’s not really addressing the issue.
But it very quickly becomes a frame where it’s almost like, all child abuse is caused by online [activity.] So, the frame of the problem is suddenly technological. So that frame of the solution is, of course, technological, right? And all of this ignores the fact that there are children suffering in the real world, and they need help.
The majority of abuse happens in families and when it doesn’t happen in families it is largely perpetrated by an adult who is an authority figure tasked with caring for a child in some form. That’s not happening online. That’s happening in the real world.
There are dynamics here that are really, really grim, that we do need to look in the face if we’re going to address this. And I think in some ways, abstracting this online and making it a problem of the technology and of the “tech boogeyman” is a way of actually avoiding looking at those dynamics in the face.
This statement seems to reflect a misunderstanding of the fact that there are two planes of issue space here — the specific (Whitaker’s) and the general (mine). The specific (call it a horizontal issue space) might be viewed as child abuse, child sexual abuse, or domestic and sexual abuse more broadly. The general (vertical, mathematical) pattern of this type of program (Chat Control) and other programs like it — with lots of issue space planes cutting through it — is mass security screenings for low-prevalence problems. Panning out to link the horizontal and vertical patterns, the issue is evil. We are animals with essentially unchanged 100,000-year-old hardware who are sometimes not very nice to each other and not very smart about trying to change that, even when bad behavior threatens societal interests.
In the policy space, whether or not a program degrades exactly what it seeks to protect is the issue. What we need to talk about is whether this program works. And it backfires according to the math, just like all programs of its type. Mass security screenings for low-prevalence problems suffer from the irresolvable tension between accuracy and error in signal detection, trading off excessively large and damaging numbers of false positives in exchange for persistently problematic false negatives in a world where dedicated attackers can inflate the false negatives, finite resources invested sorting through the false positives could have gone to other methods instead, and the false positives themselves undermine the very security the programs are intended to protect.
So let’s stop having the wrong conversation about whether we want to trade some liberty for some security, when that trade is not on the table. Let’s not waste time virtue signaling or otherwise lamenting that bad behavior is bad. And let’s start digging into what liberty and security mean (and don’t mean) here.
The High Cost of False Positives Accrues to Liberty and Security Alike
… we were told more than once that it’s too bad that there are false positives, but we have enough applicants. We don't have to worry about that. We just want to be sure that people pass the polygraph. And the response not for me alone, but from other members of the committee was: “But that’s false positives. What about false negatives? Do you understand the properties of the exam that you’re administering and what it’s doing?” And by the way, we don’t buy the other argument, either, but it clearly had convinced the people at the so-called three-letter agencies. - Stephen Fienberg, 2009 interview
Why do civil liberties advocates and security experts at governmental agencies alike commonly get it wrong when it comes to the implications of probability theory for mass security screenings for-prevalence problems? They seem to share a misreading of the confusion matrix that has the same form and applies the same probability theory across different cases of this type of program — from scanning all private digital communications for kiddie porn, to polygraphing all employees with access to sensitive information for espionage. With apologies for some repetition from a previous post, that matrix looks like this:
Filling out with specifics from my favorite example again: From the NAS polygraph report, estimating outcomes in a hypothetical population of 10,000 National Lab scientists with 10 spies (a higher-than-probable base rate), using a screening with 80% accuracy (a very generous assumption), and omitting cases where no determination could be reached (a non-negligible group), the table below showed there was a 99.5% probability (1598/1606) that a person classified as deceptive was truthful, an almost 16% chance (1598/9990) an innocent person failed, and a 20% chance that a spy passed. There were nearly 200 innocent people failing for each spy caught (1598 false positives / 8 spies). The same math maps onto other mass surveillance programs.
This sort of table uses frequency formats to improve Bayesian reasoning, following Gigerenzer and Hoffrage. A lot of research seems to suggest human beings have poor statistical reasoning. But putting outcome spreads in terms of counts like this table does, instead of probabilities like experts tend to, helps us make much better inferences without statistical training. Of course we have to count the bodies; that’s what we evolved to do. That’s what’s so brilliant about this table.
But what this sort of table doesn’t do is let the reader easily calculate net costs in terms of ultimate values to society like liberty and security. Because something happens to the 200 innocent people failing for each spy caught here — just like something happens to the innocent people (many of them children or teens) whose private communications would be incorrectly flagged by client-side scanning as child abuse under Chat Control, or the innocents wrongly flagged as terrorists by emotion-recognition AI screening travelers, or as fugitives by real-time biometrics and predictive policing in public spaces, or scammers by the latest welfare benefits screening algorithm. That something might include traumatic investigations, job loss, trouble in school, child custody repercussions, and more. These are also outcomes with security implications in precisely the contexts we care about most, whether they be implications for nuclear scientists’ work environment, or for children’s well-being.
And, as Fienberg emphasized, the missed cases (false negatives) have important security implications, too. So for many reasons, we can’t just map “security” onto “true positives” and “liberty” onto “false positives,” or anything as simple as that, and call it a day. So how can we map these sorts of values we’re ultimately interested in, onto what we know about outcomes?
Watch the Ball
Remember the 2007-2008 U.S. financial crisis? It happened because big financial institutions improperly valued properties until the housing market collapsed. Financial institutions (big banks) had amalgamated mortgage-backed securities from lots of different sources — some of them the traditional, safe ones; and some of them, not. Perverse incentives played a role in generating the latter, less valuable assets, some of them well-intentioned but dumb (e.g., home ownership programs for people who couldn’t afford to own homes). The problem arose when it turned out that the paper trails describing the contents of these amalgamations had not been properly made and kept. Sometimes, where there should have been a paper trail, there was only an electronic soup. So many people had no way of knowing what the securities really consisted of, and overvalued them.
The moral of the story is, just like your mom told you, always keep the receipts. Or, if you prefer, watch the ball. And that sometimes, people in complex modern societies create big problems by failing to do this.
Similarly, we should watch the ball when we talk about liberty and security (or safety and freedom, or human rights and the war on terror; your terminology here). The way to do that better in reasoning is usually to make it visual (my crude way) and/or mathematical (many a better methodologist’s formalizing instinct). Let’s be crude for now. But first let’s look again at the messy way current discourse does this mapping, and why it’s wrong.
The way most people talk about liberty and security in mass surveillance discussions now is like the banks’ pre-crisis, forgetful amalgamation of different parts that are missing their paper trail to check the math. It implicitly prioritizes security and presumes these programs net promote that value compared to available alternatives, when empirics don’t establish that. Because security in National Labs, for example, comes in large part from the scientists themselves, subjecting them to potentially abusive interrogations generating excessively high numbers of false positives doesn’t advance security. It degrades it. Fienberg and the rest of the NAS polygraph report committee didn’t believe the so-called three letter agencies that they could afford to throw away all the human resources in the “deception indicated non-spy” cell in order to reap the benefits of throwing away some spies, in part because the Labs are those people, too. People are not fungible. The supply of possible nuclear scientists is finite. Making screenings tougher to remove more spies would, at its most extreme, mean no nuclear scientists. This would be bad.
The same logic works on the individual level. If you want no risk, you have to die to get it. Then you’re dead, and you might as well have taken some existential risks to live. So organisms take risks all the time, because it makes evolutionary sense (sometimes). Risk-aversion gets risky, fast.
Similarly in the Chat Control context, security for children and youth comes in part from the privacy of their private communications, particularly when they may be exchanging personal photos or texts; so breaking their encryption “to keep them safe” and subject them to more invasive scrutiny doesn’t necessarily contribute to children’s net safety in terms of keeping adults from oogling, badgering, or otherwise hurting them. It may endanger them instead. It’s uncertain how many children a program like Chat Control would save from previously unknown but definite abuse, how many children it would endanger instead, and how we should weigh those unknown harms to members of the same group we want to especially protect. It’s wrong to call trading some children’s security for others, in this sense, a “liberty for security” bargain. That’s not what the children harmed by such a program would call it.
So how do we do this better? How do we watch the ball? Where’s the chain of logic or evidence flowing to and from when it comes to liberty and security in these programs?
Watching the Ball with M&Ms and Wiley Coyote
Worrying about these sorts of things, I imagine the frequency counts of outcomes in confusion matrices like the NAS polygraph report table above as black-and-white wrappers (theoretical mathematical estimates of hypothetical outcome spreads) on lots of little colorful bits (values associated with these sorts of outcomes’ real-world effects). Let’s call the colorful bits red for security and blue for liberty. Here’s the catch: It’s not all red or all blue bits inside each black-and-white wrapper, and there aren’t the same number of bits as there are bodies being counted in each cell.
In the Schoolhouse Rock (cartoon) version, rotoscoped like all my fantasies by Richard Linklater, the numbers unwrap and the colors come flowing out. Not just red and blue, but also yellow (efficiency), green (how easily this thing can be gamed by bad actors), and other colorful bits. Different colors migrate from across confusion matrix cells into different amalgamations — their respective, monochromatic blobs outside the table, in “the real world.” Some of the colorful bits are flashing to indicate uncertainty. This image shows why, if you wanted to calculate “net costs (or benefits) to liberty (or security),” it wouldn’t be the false positive or negative count alone. Maybe you don’t buy that from this imagining and need case studies to follow out the chains of logic and evidence; fine, but that’s a different exercise. I’m just showing what better mapping would look like, in a crude daydream, in theory.
Next, we could relax some of the unrealistically generous assumptions in the usual frequency count estimation of such programs’ consequences, one by one, and see the effects on the visualization (now an interactive simulation): Binary outcomes might become continuous, with a new dimensionality (perhaps shading) for uncertainty. I don’t know how to envision different forms of uncertainty; I don’t even know off the top what the different forms of uncertainty to envision here are. That’s probably important to figure out.
In another episode, lab-grown accuracy figures appear as Wiley Coyote falling off a cliff after being chased by a Road Runner flock of reality. A diverse flock, since greater real-world diversity often limits the generalizability of lab results, causing highly accurate AI trained on predominantly white male university math and computer science students (for instance) to be not so highly accurate in the real world, after all. In the next scene, Wiley is being chased by the environment itself — the alternating driving wind, brutal hail, and blistering sun of savagely different real-world conditions chasing him off the cliff again. And in the next, we see he falls again and again, for eternity, like Sisyphus. But that, on closer examination, we do not have the tools to determine, much of the time, where he fell. Inferential uncertainty follows him off the cliff. The problem of validation encompasses the closing credits, which scroll over the cartoon circle reading not “The End,” but “We Don’t Know.”
Then the cartoon begins again, bracketing our immense ignorance and scientists’ ultimate powerlessness to resolve it (often) in the way many policymakers think we can, if we just “nerd harder.” Reverting back, too, to a pretend binary outcomes world, to free up shading for illustration purposes. Forget security and liberty. Let’s just turn briefly to real-world outcomes, costs and benefits — looking for some set of empirical measures that do correspond to the hypothetical math of these sorts of programs.
We Are Clueless and Science Is Hard
Usually, people designing research under conditions of inferential uncertainty look for their keys under the proverbial lamppost. That might mean, for instance, counting up the confirmed positives in mass security screenings and calling it a day. This is bad logic, bad science, and bad policy-making that harms society in all the ways I’ve already described.
There is a better way to do research. It starts with clear quantitative goals. There is hope for inference even under conditions of inferential uncertainty; there is no hope for escaping math. So we should be spending our limited resources working on difficult problems by trying to do better science on them, and not by trying to escape math. This means banning mass security screenings for low-prevalence problems, because they backfire. And turning to the harder problem of learning more about what works, and doing more of that, better.