Could Polygraphs Save Lives?
My dissertation suggested they reduced police brutality; here’s why I never trusted the results, but they could inform future research
What if a controversial, widely criticized screening tool actually reduces police brutality? My difference-in-differences analysis suggested it might. Here’s why I never trusted it — but it could still contribute to important research.
Rules of the Game
“Try it on the ground first. Then you can think about climbing the tower with your flying machine.” That’s the rule of thumb for finding out if something might work if you’re an aeronautical engineer.
“Run an experiment in the lab and a quasi-experiment in the field. Then you can think about doing a field experiment.” That was the rule of thumb for estimating causal effects (i.e., finding out if something worked) when I was in grad school. (Apparently no one in social science methods at U.Va. in the 2010s had heard of the causal revolution. Ask me about the bygone days of my misspent youth.)
So that’s what I did.
My dissertation research was ambitious and painstaking — videotaped interviews, a by-mail survey, seven online survey experiments, two lab-based psychophysiology studies, and an extensive set of difference-in-differences (DID) analyses on nationwide police (LEMAS) data. (My committee had me leave a bunch of stuff out of the defended version to make a coherent whole. Following this advice, instead of compiling and archiving everything in one big, online appendix, was my mistake.)
And yet, when I defended, I knew it wasn’t done. It didn’t feel right. I had gotten a field experiment approved and not run it. It was underpowered. In retrospect, it was also trying to answer the wrong questions.
I still can’t answer the questions about racial and confirmation bias in polygraphy that the NSF funded me to answer. At least, not in the definitive way I thought it was my job to answer them. (Which turns out to be the wrong approach to answering these and many other questions.)
Those were, anyway, the wrong questions — to which others have recently offered a wrong answer. My answer at the time was also wrong. To be fair, Stephen Fienberg and the National Academy of Sciences also got the wrong answer, albeit to something more closely resembling the right question.
All the quantitative methods I used were hot, especially these econometrics. (I still can’t mention them without people smarter than me responding that this is above their pay grade.) So why didn’t I publish my findings?
I DID It: Seven Methodological Sins of a Repentant Econometrician
Other people have written about why DID is great, a case I will happily let them continue arguing. I am only interested in why my own analysis were, in my view, too flawed to publish. Here are its seven cardinal methodological sins:
The most common and serious DID sin.
The most obvious SUTVA violation here is that different departments use different polygraphs differently. These differences in hardware, software, heterogeneous analog versus digital scoring practices, control question versus relevant/irrelevant test structure, and who knows what else could all make a difference (or not).
Matching method limitations — collapsibility trouble
Although I wasn’t familiar with it at the time, the matching method used (coarsened exact matching, or cem) is vulnerable to a form of sparse data bias known as collapsibility which could introduce more bias than it corrects for.
Here’s how it works: If you subdivide data again and again into more finely covariate-matched groups that further predict the outcome, then you will see subgroups with ever higher odds that do not match the proportions. It’s like if you cut cake into smaller and smaller pieces — the odds of one slice having more frosting than another increases.
Unit trouble
The point of doing research is to make the world a better place. In order to have any hope of achieving that goal, researchers must perform analyses in units that can be interpreted in terms of practical importance — and do it.
The analysis looked at log proportions of key outcomes of interest (e.g., African-Americans as log proportions of sworn full-time officers; log proportions of sustained citizen complaints of excessive use of force). These units are technically interpretable. But it’s a bit of work, and it wasn’t done.
Causal trouble
No causal diagramming (DAGs), no clear causal model, and (worst of all) no recognition of the problem of selection on the dependent variables — diversity and brutality.
Descriptive statistics showed that the police departments that selected into having polygraph programs had lower black representation and worse brutality. But was that because they had bias and corruption problems — problems that introducing more police recruit selection tools, like polygraphs, could have been meant to address? Or because polygraphs somehow institutionalized racial bias and worsened police violence (e.g., by selecting on authoritarianism or some other trait that predicts it)? Or something else?
The fact is that we don’t know, and matching without trying to think it through is a recipe for bias. (I argued on p. 200 that, as long as the self-selection factor is constant over time, especially since the time period is short — 2003-2007 — this design is not vulnerable to that threat. That seems wrong.)
Statistical fairness criteria incompatibility nonrecognition
Anachronistic since the literature on this problem was only published after I defended my diss, but still a problem for bias research such as this.
Failure to report and fully interpret 95% compatibility intervals, and related statistical significance testing misuse, characterize my whole dissertation including these analyses. I knew not what I did; I have retaught myself many things that I was taught wrong, and there also seems to have been a lot of recent progress in statistics and science reform.
Like #3, this one is fixable, but it wasn’t done.
These issues still bother me today, which is part of why I never published this analysis. (I bet there are even mistakes in this retrospective evaluation of the mistakes…)
That was the right thing to do, right? Unless it, too, was another mistake!
File-drawer contribution
Not publishing left a gap that others filled — incorrectly (see Whittaker et al).
My research — including results from these DID analyses, and the expertise I gained from doing it — highlights that recent claims of racial bias in polygraphy lack sufficient evidentiary basis. My results did not establish systematic racial bias. Not in polygraph chart interpretation or other tech-mediated decision contexts in multiple survey experiments on Amazon’s Mechanical Turk. Not in mock polygraph subjects’ physiological responses under stereotype threat. Not in these nation-wide police departmental polygraph (LEMAS) data. And not in licensed Virginia polygraphers’ survey responses.
At the same time, my results did not establish lack of systematic racial bias in polygraphy, either. “Absence of evidence is not evidence of absence,” in Altman’s terms. Null hypothesis significance testing contributes to the sort of misinterpretation of statistical tests that I did back in the days of the dissertation. What my findings actually suggest is uncertainty (more on this, and the data, in a future post).
But publication bias is hard to fight. Correcting the scientific record is difficult. There are too many disincentives for researchers and journal editors to admit that they made mistakes and fix them. Too many perverse incentives for quick and dirty publications.
Ironically, though — in spite of my own previous focus on possible racial bias in tech-mediated decisions including polygraphy, and the current bias bandwagon on which it appears Whittaker et al hopped without sufficient evidentiary basis — my DID findings suggest it’s possible that police preemployment polygraph screening programs reduce police brutality, one of the deadliest forms of racialized violence in American society. If we truly care about racial justice, not to mention quality in science and policing alike, we need to know if that’s true… And we could actually do better research to find out.
Life or death
Bias in polygraphy may sound like a niche topic, but what if polygraph programs have life or death consequences? My DID results suggest police polygraph programs reduce police brutality. There’s just one problem. (Well, the previously mentioned seven methodological sins plus one.)
These brutality findings invoke the multiple comparisons problem. This increases the uncertainty that the findings estimate a true effect. If you run a bunch of analyses, chances are something will come back “statistically significant.” If the researcher then represents that as the Truth, it’s like when a con artist doing a cold reading says a bunch of stuff and then runs with what lands. (Aka fraud.)
At the time, I wasn’t sure what to do with these sorts of problems. (I’m still not completely sure.) There was a notable disconnect between the upper-echelon methods discourse and the teaching/statistics practices I observed, and it took some time to learn more and process how serious this disconnect really was. Suffice it to say, I got some bad advice and am glad that I didn’t follow it.
I wish I had kept better records now, over a decade later, of what exactly I had planned. A few related IRB protocols mention an authoritarian selection hypothesis I tested, planned to test more, and abandoned instead because it didn’t seem to be panning out. So I suspect that I had hypothesized the opposite effect from what I found here (and could dig around more trying to find a clear trail) — but am not sure. I just know I set out to study ascriptive (racial and gender) bias first and foremost in these DID analyses, also looked at what polygraphs and other selection tools seemed to do to other interesting outcomes, and found this interesting brutality result in that process. So all I can do now is say that. This is not the right way to science. Preregistration should be the norm.
In spite of all these flaws and uncertainties, this suggestive finding could be practically important and could make causal sense: Instituting a polygraph program has the biggest possible effect on sustained complaints of excessive officer use of force, and the second-highest total such complaints (after credit checks), of any police selection tool (see dissertation, Chapter 4, Table 5).
How big? We don’t know. (See sins #3 and #6 above.)
How does it work, if it works? We don’t know. (See sins #1 and #4.)
But three mechanisms might make sense…
Revisiting causality
Police polygraph programs could conceivably reduce brutality through some combination of three causal mechanisms, all of which have long been recognized in relevant scientific literature and professional discourse:
“lie detection” (implicating Bayes’ rule as applied in the NAS report Table S-1 A and B) — polygraph as test;
deterrence — polygraph as perceived threat; and
bogus pipeline — polygraph as interrogation tool. A bogus pipeline is a fake “lie detector.” The idea is that people confess more when they believe a lie detector works, even when it’s fake.
These mechanisms might be expected to interact in the field, and the latter two can be scientifically validated (more on this in a future post).
But wait, reducing police brutality can be construed as a way of net increasing security. The NAS concluded polygraphs would undermine security in the National Labs. What gives?
The NAS’s analysis omitted the second two possible mechanisms. Maybe the addition or interaction of those additional mechanisms makes these programs net beneficial for organizations in terms of improving workforce quality in measurable ways. Like decreasing police brutality. That sort of effect would probably even pack a punch (pun intended) against racial bias.
If this is right, how did the Academy get it so wrong on such an important issue?
Revisiting context
We would expect the deterrence and bogus pipeline mechanisms to work better in relatively naive populations. Scientists might be more likely to know (or believe they know) that polygraphs are pseudoscience than the general population. Police recruits might be more likely to believe in polygraphs, and to believe police polygraphers who tell them that polygraphs work. So it might have made contextual sense to estimate hypothetical outcomes as NAS did in terms of the “lie detection” mechanism alone — for the scientist population.
It might also make market sense for the state to think that way. Top physicists are a much scarcer population than police or other security agency recruits. High-level intelligence community sources reportedly told NAS polygraph report co-chair Stephen Fienberg that they could afford to throw away all the false positives. Fienberg told me he didn’t believe them. Market differences could explain the discrepancy; maybe the National Labs couldn’t afford that human resources loss, but other agencies could.
In addition to these social dimensions, there was sociopolitical pressure from the Academy’s fellow scientists to keep themselves from being subjected to polygraphs, which are notoriously unpleasant. Under conditions of relative scarcity for their skills, the scientists’ pressure would have been comparatively effective. Police recruits don’t have that kind of power.
Nor do they run the National Academy of Sciences. Perhaps the scientists sympathized with the scientists. It might’ve been particularly easy to do so at that particular sociopolitical moment, in wake of the Wen Ho Lee spy scandal and then-ongoing legal conflict around related alleged abuses.
In the big picture, the scientists’ pressure responded to the DOE’s proposed polygraph expansion, which responded to the spy scandal, which security agencies feared indicated security problems that could also have life or death consequences. Some of the nation’s top scientists on one hand and security agency management on the other both argued that their diametrically opposed positions (anti and pro polygraph screening) net advanced security.
Limited by the same perspectives that informed their expertise, neither side recognized that both were succumbing to a common cognitive distortion — believing that we know something we don’t know (uncertainty aversion). In the end, whether polygraph screening net advances or degrades security remains an open empirical question. One both sides have a stake in trying to answer.
Future research
A field experiment would (still) be the gold standard for trying to measure important real-world outcomes as they relate to polygraphs. A randomized national field experiment comparing police departments that institute polygraph programs with ones that don’t on the primary measures of sustained and total citizen complaints of excessive officer use of force might produce evidence of their efficacy at increasing workforce integrity and thus community security. It might also show if these costly, controversial programs have no apparent such effects.
Polygraphers could conceivably grow their industry and prestige by expanding police polygraphy from a piecemeal affair — required in some states, avoided or outright outlawed in others — to a nationwide best practice. Police reformers could conceivably make headway decreasing brutality if that effect panned out — or building a stronger case against polygraphs as unscientific if it didn’t.
Polygraph proponents and their critics alike — including statisticians and others interested in this sort of program as a broader class — could learn about whether and how human behavior in this context seems to limit what we think we know about the implications of Bayes’ rule for mass screenings for low-prevalence problems. Maybe some such programs make sense after all, because people are more complicated than math. Or maybe they don’t, because no one is above the (mathematical) law.
To know how it works, if it works, we would need to build in some ways to study the possible deterrence and bogus pipeline mechanisms. This would be methodologically and ethically complex. For instance, a field experiment that collected confidential recruit self-reports on criminal history using established survey research techniques for eliciting sensitive information would have to be carefully designed to avoid creating legal vulnerability for its subjects. Randomized response offers an established way to approach solving this problem in sensitive survey research.
Similarly, one might think that we would want to match polygrapher reports of relevant polygraph test outcomes versus polygraph interrogation criminal confessions with hiring outcomes in order to be able to assess what mechanism is at work — but no IRB would ever approve this, for good reason. Other methods would have to be used to assess causality in a way that protected research subjects. Ironically, security agencies may have something to learn from survey researchers here about eliciting sensitive information — and scientists probably have something to learn from polygraphers on the same topic.
There’s just one more problem: If polygraphs work, but only because the people taking them aren’t always so sure they don’t — then, once people know that, they might not work anymore. Maybe the NAS was even engaged in some weird sort of Kabuki theater in its misplaced focus on the polygraph-as-test causal mechanism. (But Stephen Fienberg gave me this 2009 interview about his work co-chairing the polygraph report committee, and that didn’t seem to be the case.)