Chat Control is coming for your dick picks. Also, possibly, for your children.
Allow me to explain. Imagine if, every time you were using telecommunications — the phone, email, messenger, chats, or videoconferencing — government agencies required the companies running the infrastructure to scan your communications for sexually inappropriate material involving minors. And companies were required to report hits to the police on the basis of some algorithm that analysts don’t understand well. This is the promise of Chat Control, a new AI tool the EU Commission proposed in May.
Kiddie porn has a very low prevalence. But the proposed screening is mass. Essentially all digital communications. This makes Chat Control the latest example of mass screenings for low-prevalence problems. When these screenings take place in the security realm, they’re called mass surveillance.
Probability theory dooms these programs to fail even if the proposed screenings have very high accuracy rates. The reason is that the base rates of the problems they’re after — espionage, terrorism, documentation of child sexual exploitation — are very low. The problems are rare, so even highly accurate mass screenings will generate overwhelmingly bad leads, or false positives. This insight is an application of the mathematical theorem called Bayes’ Rule.
You can’t MacGyver your way out of math. The Reverend Bayes come for us all. Like death and taxes, he comes in many forms. But somehow, no one ever sees this one coming. Security researchers keep designing mass surveillance programs. Defense interests keep funding them.
That probability theory says these programs backfire — hurting security — is a revelation every time. Even some experts who understand this fact one minute seem to lose sight of it the next. Human beings have poor Bayesian intuitions. We have to constantly check ourselves against the logic that statistical training teaches. Most people don’t.
This isn’t just about security screenings. There are many other, related examples, like medical tests. And I’ll talk about them in another post.
This post explains why mass surveillance hurts security, Chat Control threatens to do that in particularly egregious ways, and even people who have done some of the best empirical work on it in the public sphere may have not fully understood the implications of probability theory in this context.
Scientists Saved Themselves
To understand this story as a repeating one, let’s first go back in time to the Wen Ho Lee spy scare (footnote 9). It was 1999 when the Taiwanese orphan, who had risen from rags to rockets — conducting nuclear explosion simulations at Los Alamos — was polygraphed, accused of passing nuclear secrets to the Chinese, and indicted. Ultimately he would cop a plea to mishandling restricted data, and receive a $1.6 million settlement for the government having mishandled his data in turn. The feds liked how the polygraph had worked so much that they proposed a new policy: Polygraph all National Lab employees.
The Lab scientists said no: Such a program would harm security by generating an excessively large number of false positives. One senior scientist who resisted was forced to resign. But so valuable are National Lab employees as a group to the state that they were able to get Congress to ask the National Academy of Sciences (NAS) to look into the science behind the polygraph and lie detection, the better for Congress to consider the Department of Energy’s proposed polygraph policy.
NAS tapped public statistician par excellence Steve Fienberg to co-chair the committee. Fienberg, who had no preconceptions about the subject matter, applied Bayes’ Rule to make this table at the heart of the report. He presented it, along with the rest of his testimony to Congress, to show why the DOE’s proposed mass screening of scientists for the low-prevalence problem of espionage would be a bad idea.
This table shows polygraph screening of all scientists at the National Labs would have hurt security by generating an excessively large number of false-positives while missing some spies. In this hypothetical, there’s a 99.5% probability (1598/1606) that a person classified as deceptive was really telling the truth. There’s almost a 16% chance (1598/9990) an innocent person fails, but 20% of the time, a spy passes. There are almost 200 innocent people failing for each spy caught (1598 false positives / 8 spies).
Fienberg told me how this table changed DOE’s minds. They didn’t want to back down. They didn’t give him much courtesy letting him know they were going to do it, either. But they knew, when they saw his submitted testimony, that they couldn’t beat the math. And scrapped their plan for mass screening.
But not before first ignoring the Congressionally mandated NAS polygraph report to which they were legally required to respond for six months. And then reiterating their planned mass polygraph screening policy in the Federal Register on the response deadline, resulting in the Senate scheduling hearings. Not even before Fienberg had submitted his Congressional testimony, based on his report and how their response ignored it, and gotten on a plane to DC to testify. It was there, before take-off, that he got a call from the Deputy Energy Secretary, Kyle McSlarrow, saying he had changed his mind. Fienberg’s planned testimony went out the window. He had made a difference; but he would have to prepare different remarks, and improvise before Congress in response to what turned out to be continued changes in DOE’s story.
The scientists had saved themselves.
Others Weren’t So Lucky
Fienberg regularly spoke out against next-generation “lie detection” programs like FAST (Future Attribute Screening Technology) and SPOT (Screening Passengers with Observational Techniques) before his death. These program names don’t mean anything to most people now, because they were stupid programs that died years ago. Probably. The news has been quiet about them for years at least, and DHS doesn’t put out a press release when this happens. If they did, it would be entitled “We Tried Escaping Math: It’s Still Not Happening.” And there’s still plenty of such trying going on, e.g., targeting migrants to the EU with AI.
Fienberg didn’t take on federal security agencies over polygraphs. Employees at places like the National Reconnaissance Office or Central Intelligence Agency live in the dragnet of these programs. FBI documents among others I obtained as part of my dissertation research recount racist and other abuse; but federal agencies consider their security processes exempt from EO law. Breaking the law at home is great training for taking it on the road. A survey of American polygraphers who conducted dragnet polygraph screenings as part of the war on terror in Iraq and Afghanistan showed that they themselves didn’t much like what they were doing on the ground.
These were small fish. Fienberg was already battling cancer when Ed Snowden showed the world the U.S. was conducting mass telecom surveillance. It might have been rightfully his leap to make — scaling up the NAS table to entire populations, as I did in 2016 — and just not have been his time.
Or he might have been too wise to pick such fights. I once tried getting racial bias research approved in Singapore, where he had a post. His colleagues there said it was impossible for reasons that made no sense, and he acted like he didn’t understand what it took me a while to get, being thick in these ways: Of course you can’t do racial bias research in Singapore. Maybe it was also an “of course you don’t take on the entire U.S. surveillance state” — even if probability theory says it’s hurting security. When the Devil lets rip a cosmic fart, pity the fool who opens a window.
Tact, especially around unspoken social rules — not my department. But my statistical training was decent. We all fumble forward in the world the best way we know how, and this bit of logic is something I can get my head around and keep revisiting. It’s a wonder of the world if you understand how it works. Sometimes I do, sometimes I don’t.
Apparently, most people don’t. Because mass surveillance programs for low-prevalence problems just keep coming.
Whack-A-Mole with Stupid Security
Here in Europe, iBorderCtrl — a Horizons 2020-funded bullshit AI “lie detector” — recently died not-so-quietly in an avalanche of righteous ridicule and outcry.
Back in the States, Night Fury — a Department of Homeland Security project to give social media users risk scores for being a terrorist — was killed quietly after the Inspector General received information on potential privacy violations, recently released documents reveal.
Next up: Chat Control. The intent is good. It’s an effort by the Cybercrime Centre and Contact Point of North Rhine-Westphalia and others to see if they can do AI mass surveillance of chat communications on some major platforms to flag child sexual abuse.
Probability theory remains impervious to technology. The math is still the same. And the experts don’t seem to fully understand the implications. The bottom line is that these programs — mass surveillance for low-prevalence problems — generate excessively large numbers of false positives even when their accuracy is quite high. This undermines the security they’re intended to promote by draining finite resources, and hurting a lot of innocent people in the process. When they target minors, those people may disproportionately be minors.
Expert Understanding of the Problem with Chat Control…
German law enforcement expert Markus Hartmann (Generalstaatsanwaltschaft Köln) was the first security expert I’ve heard calling out the false-positive problem (h/t Jesper Lund). From his prepared statement to the German Parliament (Bundestag) hearing on Chat Control (news coverage, video, English translation):
The main challenge when using automated detection is carefully balancing the detection focus. The tool used should miss as little illicit content as possible (false negatives) while wrongly identifying as little legally unobjectionable content as possible (false positives). These two objectives clash with each other. If the focus is minimising false positives, there is a danger that significant quantities of illicit communications would not be detected. Conversely, maximising the detection rate of genuinely illicit content would inevitably result in a higher false positive rate. The risk of innocent members of the public coming under suspicion due to false positives produced by automated detection is therefore not a static calculation. The AI tool developed by the Cybercrime Centre and its partners, AIRA (AI-enabled rapid assessment), currently detects more than 90% of the relevant illicit content with a false positive rate in the mid to low single-digit percentage range. If these figures are applied to the detection order process and the large amounts of content which would be processed, there is a significant risk that innocent members of the public would be subject to official investigations. This is particularly true with regard to the AI-based miscategorisation of cases where the visual material itself is detected accurately, but the situation under criminal law is misjudged. To give an example, this includes cases where children below the age of criminal responsibility have posted material themselves, or communications between young people in consensual contexts (see section 184c (4) of the Criminal Code).
False positives ultimately represent a misallocation of resources for the investigating authorities, as in fact no initial suspicion exists that an offence has been committed… it remains questionable whether individual redress can be an adequate corrective to any misuse of detection orders. It thus falls to providers to guarantee their users’ rights, a role which they are hardly in a position to fulfil properly, in view of their primarily commercial interests. The introduction of a strong, independent oversight mechanism is highly advisable in this context.
— Markus Hartmann, The Cologne Prosecutor-General’s Office, The Cybercrime Centre and Contact Point of North Rhine-Westphalia – ZAC NRW, Statement for the public hearing held by the German Bundestag’s Committee on Digital Affairs on the subject of “chat control” on 1 March 2023.
In other words, Chat Control would show around a bunch of adults’ and minors’ own personal, explicit photos in the process of potentially screwing up their lives and taking finite resources away from actually protecting children and youth from sexual abuse. Hartmann deserves kudos for grappling with the problem.
… But the Core Misunderstood
At the same time, Hartmann’s core empirical reference here — to a more than 90% accuracy rate and a low to mid single-digit false positive rate — is misleading.
Firstly, almost all of empirical science including in security is inductive, not deductive. It involves probabilistic reasoning. We’re never really sure we know what’s going on. This problem is especially intractable in criminal contexts where getting at the “ground truth” — like whether someone is lying or telling the truth — is often impossible.
Usually what happens with AI is, researchers train their pretty, shiny new algorithm on training data. They tout its high accuracy. And then that accuracy tanks when it goes out in the field, where life is a lot messier. Except in many cases, we can’t even measure “ground truth” that well to tell just how bad the accuracy gets.
In this case, no one is going to be able to check what is essentially all Internet traffic against algorithmic results to give some verified accuracy and false positive figures when this is used on the ground. We would need sampling to know well enough, and then there would still be uncertainties here. So Chat Control may well have 90% accuracy in the lab but 80% accuracy in the field, and it’s not clear how we would find this out, or that it’s been assessed for this kind of typical generalizability problem with this kind of tech. How would we already know this field accuracy figure, unless someone had been doing illegal mass surveillance?
This is why we need to ask whether this accuracy figure is lab or field, how it might change on the ground if it’s not there already, and what the privacy protections for the people in the sample used to assess this so far look like. Particularly because, if the sample is realistic at all, it includes minors’ private, consensual sexual communications. Did they consent to their use in this research?
Secondly, it’s not clear how false-positive rate is defined. But the accepted definition looks like this…
Usually “accuracy” means true positives, and “false positives” means bad leads. Notice in the previous tables that even high-accuracy screenings for low-prevalence problems generate overwhelmingly false positive results. Thus, Hartmann’s reference to a single-digit false positive rate may have been mistaken; it was certainly misleading. Because we need to look at net outcomes, not percentages, to grasp the implications of probability theory here. Even a single-digit false positive rate in essentially all telecommunications is a very big number.
That’s the reason why, thirdly, what we really want to know isn’t a false positive rate. It’s the conditional probability of the thing we want, conditional on a positive test. Counting up outcomes in all four cells of the confusion matrix above helps people see this without having to know what it means. In the Labs example, it’s the difference between saying there’s a 16% chance (1598/9990) of an innocent person failing, and saying there’s a 99.5% probability (1598/1606) that a person classified as deceptive was really truthful. Both are correct descriptions of the same spread of outcomes.
Finally, it’s not clear what false positives are in human terms. What proportion of falsely flagged communications are adult sexting? What proportion are consenting sexual communications between minors? And what happens to these people — adults and minors — when they’re flagged, other than their currently encrypted communications losing that protection under Chat Control, and being circulated among adults, ostensibly for protective evaluation purposes? Are they then subjected to potentially traumatic investigations? What happens to their parents when this happens to minors? What’s the basis for expecting the effect of the large number of false positive investigations to which minors may be subjected as a result of Chat Control to outweigh the effect of the small number of child abuse investigations initiated as a result, in promoting their well-being?
More Numbers Confusion
Pirate MEP and transparency activist Patrick Breyer has an excellent breakdown of what the new Chat Control proposal entails. While his overall work is stellar as usual, Breyer seems to contribute to some related numbers confusion here:
According to the Swiss Federal Police, 80% of the reports they receive (usually based on the method of hashing) are criminally irrelevant. Similarly in Ireland only 20% of NCMEC reports received in 2020 were confirmed as actual “child abuse material”.
These numbers seem to be taken to fit together, because 80% and 20% sum to 100%. But remember there are four cells in the confusion matrix, not two. In the Labs example, mass screening with 80% accuracy given a sample of 10,000 with 10 spies produced a 99.5% probability that someone classified as a baddie was really ok, falsely implicating an innocent person almost 16% of the time, and miss a baddie 20% of the time. Here, the 80% figure keys into false-positives (1598 in the Labs table) and the 20% into true positives (8 in the table). We shouldn’t expect these percentages, of false and true positives, to sum to 100%.
Maybe it’s a coincidence that, in this case, they do. Maybe it shows they’re wrong; I don’t know. It definitely doesn’t validate them, though. They’re not showing an expected consistency by happening to sum to 100%.
But they do suggest a real-world comparison between possible policy universes. It would be very interesting if the false positive rate had dropped from 80% in Swiss Federal Police experience, pre-Chat Control, to low-mid single-digits with it, as Hartmann testified. This suggests a possible harm prevention case for using the algorithm, if the real-world alternative is that police use vastly inferior hashing. Maybe we aren’t choosing between Chat Control and no Chat Control, but between better or worse mass screening of the sort.
Except Chat Control as currently proposed isn’t about police using an algorithm. It includes client-side scanning, in which “Providers of end-to-end encrypted communications services will have to scan messages on every smartphone… and, in case of a hit, report the message to the police.”
Breyer calls this, along with the more general obligation of “providers to search all private chats, messages, and emails automatically for suspicious content… the end of privacy of digital correspondence.” It would be a big deal to effectively lose the widespread access to routine end-to-end encryption that providers had instituted post-Snowden. It reflects a massive possible expansion of state power. And of the raw numbers of false positives being generated through such communications scanning. And the fact that it targets minors, threatens minors.
Politics and Probability Theory
We don’t control the full institutional structure, and are constantly compromising to advance security and liberty. Probably a ban on mass security screenings for low-prevalence problems is the best way to achieve both those goals. But it may not be feasible, particularly as few people seem to understand why these programs are doomed to backfire.
Security and liberty aren’t innately opposed. There is no mass security screening for a low-prevalence problem so accurate that we should want to accept it in the interests of security. New technology isn’t about to create one by somehow beating probability theory. It does not and will not exist. No matter how clever you are, no matter how good your tech is, there’s no escaping math.
Discussions of Chat Control so far seem to be missing this point, along with the crucial piece of information we need to show how these implications of probability theory play out in this case: A base rate. About how often are child sexual abusers using the screened media like encrypted chats for that purpose? Out of how often other people aren’t? How do we know? If we don’t know, how should we model the uncertainty?
This figure is central to the transparent calculations that should underpin democratic discussions about any form of mass surveillance. Without it, we should be skeptical of all circulated Chat Control numbers. Because without it, we can’t do the math to fill out the whole confusion matrix, to see the complete accuracy picture.
We have to count the bodies. And right now, we can’t.
Maybe mass surveillance for low-prevalence problems is good enough for Europe’s children. But it wasn’t good enough for scientists at the National Labs. And the two groups are equal in the eyes of probability theory. Are they equal in the eyes of the law?