"Trust Marker" Trouble
Recent reflections on how to tackle the science crisis give cause for concern
At the 8th World Conference on Research Integrity in Athens, Greece this week, leading science integrity researcher/activist Elisabeth Bik posted photos of broken women’s toilet seats and beautiful flower arrangements — leading an outsider with a selfish interest in adequate ladies’ bathrooms to do the math and wonder why no one thought to fix the johns instead of gilding the lilies. She also Tweeted highlights from the proceedings, including Dr. Liz Allen (F1000, Taylor & Francis)’s publisher’s perspective.
What, Allen asked, can publishers do to make a difference in the irreproducibility crisis, responding to fraudulent research? She answered, in part, “Build in trust-markers.”
If you care about research quality, this should scare you.
What are trust markers?
“Trust markers” are flawed proxies for the research quality signal in which scientists, publishers, and consumers of scientific literature are (ostensibly) interested. We want to know if research is true, not if researchers are trustworthy. If I’m reading your paper, I want to critically assess your process of observation, analysis, and interpretation — not whether it would be wise to let you buy me a drink.
If you have done any work on bad proxies, of which the scientific literature is chock-full, you may recognize the structure of this problem… (Regular readers please forgive the following now-familiar rant.)
A familiar structure
This is a signal detection problem that is structurally identical to other signal detection problems (see the classic Green & Swets 1966). As I noted in a recent Chaos Computer Club talk (video; slides; related posts), the common mythology around this kind of proxy across a wide range of scientific and medical contexts is that better tech will solve intractable problems (crime, cancer, scientific fraud). Many people think improved accuracy will drive this progress.
The reality, however, is that tech just categorizes. There do exist perfect cues on which to categorize. But mostly we’re stuck with imperfect, probabilistic associations.
This type of association implies the accuracy-error trade-off, in which binary classifications yield four types of results: true positives, false positives, true negatives, and false negatives. Maximizing true-positive rates and minimizing false-positive rates are in tension. Both have implications for practical outcomes of interest (e.g., security, health, research quality). If we value maximizing true-positive rates, we then also enlarge false-positives; conversely, if we prioritize minimizing false-positives, we then simultaneously shrink true-positives.
There is no exit from this statistical reality, because we can’t escape the universal laws of mathematics. Better tech doesn’t solve it, because it doesn’t change the structure of the world — just categorizes it. This inescapable accuracy-error trade-off is why we need to worry about massive societal damages to exactly the values we are seeking to protect through mass screening programs. When conditions of rarity, uncertainty, and secondary screening harms exist, these programs are likely to do more harm than demonstrable good — even when they appear to be highly accurate.
Trust markers at the heart of nontransparent research: a case study
So that’s the structure of the problem: Mass screenings for low-prevalence problems can incur massive net damages if we’re not careful how we make inferences from imperfect proxies.
How does this specific proxy — trust markers — work? How reliable are the results? Might they penalize dissent by letting modal stand in for good (i.e., bad proxying logic)?
As relayed in a previous post, in an on-record interview and follow-up messages with a trust marker proponent, I kept asking and not getting answers to these questions.
Recently, Leslie D. McIntosh published an article with Cynthia Hudson Vitale entitled “Safeguarding scientific integrity: A case study in examining manipulation in the peer review process” (Accountability in Research), Dec. 11, 2023. McIntosh is VP of Research Integrity at Digital Science, which acquired her start-up Ripeta, a company that developed and sold algorithms intended to “[detect] trust markers of research manuscripts.” The article said it selected its case study, a special issue, “due to irregularities detected by algorithms that analyze trust markers in published research.”
However, in conversation, McIntosh, responding for both authors, said “we do a lot of manual curation” in general, and that in this case she was manually exploring, curating, and “very little relied on the algorithm” (Dec. 15 interview). When asked whether the tool, which McIntosh reported focuses on hypothesis syntax, might disproportionately flag dissent, McIntosh did not provide a substantive response. The conversation was recorded with McIntosh’s knowledge and consent.
McIntosh & Vitale’s article centers on this case study. According to McIntosh’s on-record remarks, it seems the article inaccurately reported its case selection. This would be an invalidating error, because analysis of a manually selected case study as if it were selected by an algorithm is invalid. The article should thus be retracted. Invalidating errors are the generally accepted grounds for retraction, since invalid scientific reports that remain in the literature could conceivably do harm by being mistaken for valid reports.
Beyond issues pertaining to validity in this case, the use of trust markers by publishers or other institutions to conduct mass screenings for low-prevalence problems would still be dangerous for science. It threatens to takes finite resources needed for assessing and strengthening research quality, and misdirect them to creating large numbers of false positives that then need further sorting of some kind. As in other programs that share this structure, the proxies are probabilistic, the base rate of serious offenses is too low for the accuracy-error calculus to shake out, and secondary screenings to disambiguate uncertain results risk doing a lot of harm.
Correction Issued, Retraction Denied
In response to my concerns, the journal editor, Lisa Rasmussen, a Philosophy Professor at UNC Charlotte, issued a correction.
There is, however, no substantive correction in this correction. It simply changes the sentence “We selected this particular case study due to irregularities detected by algorithms that analyze trust markers in published research” to read “This case study was selected due to irregularities detected by our previously developed methodology that analyzes trust markers in published research (Sumner, Vitale, and McIntosh 2022).”
In email correspondence, Rasmussen said she could not evaluate correspondence between me and McIntosh and thus did not consider it as part of her response. She did not ask for the correspondence, or the interview recording in which McIntosh described the case selection as not relying much on the algorithm. Thus, Rasmussen apparently failed to investigate the substance of the allegation, which would have involved considering this evidence.
McIntosh’s statements suggest that both the original and the newly corrected version of the article’s case selection descriptions are incorrect.
I also pointed out in the previous post and email correspondence with Rasmussen that the software McIntosh & Vitale used is proprietary and under development, making theirs non-transparent transparency software used to produce irreproducible reproducibility research. And that I had repeatedly asked for and not received any information about the tool’s accuracy, reliability, and validation.
Rasmussen responded:
We agree that transparency and validation of methods and materials are essential for reproducible and rigorous research. In this case, however, the method (i.e., the AI tool) was used only for selecting the case to be examined, not for data analysis itself, so we consider it to be acceptable that the authors did not initially include information about how the AI was validated and that the AI is itself proprietary (April 15, 2024 email).
As social documents, McIntosh & Vitale’s invalid research — and Rasmussen’s failure to investigate and retract it as such — show how power shapes scientific discourse. McIntosh & Vitale attacked dissenting abortion researchers. This attack supports the story that powerful social and political networks would like to be true.
Thus, though invalid, McIntosh & Vitale’s work stands in the scientific record. As I wrote previously:
Research in which nontransparent algorithmic processes from which researchers stand to profit seem to have been used post hoc to both smear scientists as unethical without actually evaluating the merits of their work, and to promote those algorithmic processes as objective, neutral, and transparent, when evidence is insufficient to establish them as any of those things.
Bias in Research Integrity
By contrast, a few months ago there was a volley of retractions of abortion research that dissented from the consensus pro-choice narrative. According to the affected researchers, no invalidating errors were reported. (More on this in a future post.)
Retraction is rare, though mistakes (even invalidating ones) in the scientific literature are common. As obesity researcher David B. Allison and colleagues put it in a Nature comment, “Mistakes in peer-reviewed papers are easy to find but hard to fix, report David B. Allison and colleagues.”
So what does (and what does not) get fixed is not random. Rather, it probably reflects social and political influences. And that, in turn, biases research integrity work. As Allen reflected:
discoverability is key — are we uncovering tip of the iceberg? (legacy vs new issues)
are we addressing the root causes or currently intervening where it is expedient?
Or: Why recognize the intractable problems of human stupidity and perverse incentives, when you can make money and get publications (that might help you make more money) intervening where people want you to intervene?