Birthday Trouble

No anonymization protocol should include actual birthdays

Dec 15, 2025

“You seem angry,” the psychotherapist at the women’s NGO said.

One might be forgiven for anger at all that is wrong in the world. But this was not, as it turns out, a deep psychological insight. She had started out by asking for my date of birth to “anonymize” data for invoicing. Not invoicing to me. Invoicing to someone else, presumably a funder. Someone else who would pay for her services using an invoice with my date of birth.

No, you can’t use my date of birth to “anonymize” your data. And yes, you’re going to get a lesson now in anonymization, weak pseudonymization, randomization, threat modeling, operational memory, and why you should never, ever invite a statistician to your dinner party (and apparently also not to your NGO).

Sensitive Data

Researchers and organizations that work with vulnerable subgroups — such as minors, prisoners, survivors of domestic and sexual abuse, and targeted minorities — have a special responsibility to do no harm. So indeed do all medical professionals, if they want to uphold the Hippocratic Oath. So do researchers according to the Nuremberg Code.

That responsibility includes keeping data secure to protect privacy. The best way to keep identifying data secure is to not have it. The rule is: if you don’t need it, don’t collect it.

Names are identifying data. Dates of birth are identifying data. Names plus birthdays in combination are among the strongest quasi-identifiers we have. Both are used as part of identity verification processes by all kinds of official institutions, like banks and governments.

If you made a list of identifying data, names and dates of birth would be at the top of the list. And then you wouldn’t use pieces of either of them, least of all in combination, to “anonymize” data. You would burn that list, tell no one that you had ever made it, and make an entirely new list using no actually identifying data if you wanted to anonymize. Here’s how!

Wait, but Insurance Does It

“But this is also the protocol health insurance companies use, for example,” the psychotherapist protested.

These are different cases. Insurance companies have to link all kinds of sensitive data all the time. In that case, the person paying (insurance company) has to know who the individual is (patient/subject/client) to pay.

If you’re an NGO billing on a grant, however, you don’t have that reason to link data. In this situation, there is no operational need to link invoices to identifiable individuals. NGO clients don’t normally talk to NGO funders. Rather, in the NGO case, there are operational needs to protect privacy as a matter of security.

(This is to say nothing of the fact that, if you tell people you’re anonymizing their data, then you should actually be anonymizing their data. Otherwise, you don’t have their informed consent to do what you’re doing.)

Under GDPR, as well as widespread medical practice and general research norms, the lack of operational need alone triggers the principle of data minimization:

Only process personal data that are strictly necessary for the stated purpose.

Here, identifying data are not necessary. So they shouldn’t be collected.

So it should come as no surprise that there is no justification for collecting identifying data under GDPR’s “legitimate interest” test in this case, because:

the client is not the payer (NGOs often invoice on grants)
no insurance processing occurs
there is no follow-up billing to the client
the organization’s accounting can be satisfied without identity linkage

The client’s right to privacy dominates. That means no identifiable data should be processed. And certainly not mislabeled as “anonymous.”

Anonymization versus Pseudonymization

Under GDPR, anonymized data are data which cannot be linked back to a person. Pseudonymized data, by contrast, has its identifiers replaced — but re-identification remains possible.

Technically, initial + birthday is a form of weak pseudonymization at best.

This is way worse than it sounds: replacing a name with an initial does not necessarily decrease identifiability; paired with a birthday, rather, it increases it.

This system is highly identifying in small populations, often unique, and makes re-identification trivial when combined with other information like location.

In a small NGO, staff can usually infer exactly who the person is from this information alone.

This is a terrible practice that should be stopped immediately. Why?

Threat Modeling

For domestic and sexual abuse survivors, data linkage risks are real. Identifiable billing artifacts could surface in legal disputes. Abusers often smear victims as unstable, even though needing support after abuse is normal. The (accurate) perception that “anonymous” records are not really anonymous could also be harmful and deter people from getting needed help.

But actual anonymization is hard! People are really bad at randomness. We can’t generate it ourselves. And you can’t just use a random number generated from Excel, for example, because it’s not really random.

This is where people start telling spy stories. Ordinary tools like Excel and browsers tend to use ordinary pseudorandom number generators. This isn’t cryptographically secure. Somewhere, a mathematician at a three-letter agency is shaking her head.

That’s true but irrelevant. The threat model here is not some nation-state that wants to systematically re-identify NGO data. Although Russia could do that, and then blackmail leaders later. But it’s not the right threat model; it’s “The Americans.”

The right threat model for a women’s NGO is accidental disclosure, casual internal inference, sloppy paperwork, and the slow accretion of linkable records. Against those very real risks, even modest randomization — a short alphanumeric ID generated once and stored internally — is a massive improvement over using initial + birthday, which is strongly identifying. You don’t need spy-grade entropy to stop leaking identity. You just need to stop intentionally using identity when you’re trying to protect identity.

How to Actually Anonymize Your NGO’s Sensitive Client Data for Invoices

The goal is to create an internal reference that lets the organization do accounting without encoding identity. There are many ways to do this. None require special software. All are better than using initial + birthday or any other snippets of any other identifying information.

Option 1: Sequential IDs

Create a list of identifiers such as 001-200 (or A001-A200).
Print the list and keep it in a binder or secure file.
Assign the next unused ID when a client begins services.
Use only this ID on invoices and grant accounting.
Yes, it’s really that easy.

This is not random, but guess what? It doesn’t matter! The purpose is not technically cryptographic anonymity. The purpose is just functional anonymization or deidentification — likely unlinkability (or un-relinkability) in practice.

Sequential IDs are super easy and decrease the likelihood of re-identification a lot compared to the very, very bad initial + birthday system. But then you still can’t tell clients you anonymize their data, because you still don’t.

Option 2: Spreadsheet-generated IDs

Please do not kick me out of nerd club for recommending people use Excel. People use Excel. A lot. And it’s fine.

Use Excel or Google Sheets to generate short alphanumeric IDs (e.g., A7F3Q9).
These tools use pseudorandom number generators, not cryptographically secure randomness. That’s an obscure technical detail that doesn’t matter practically in this context. All we want is to prevent casual re-identification and internal inference.
Generate a batch once, paste values (to freeze them), print or store securely.

Option 1 is better IMHO because it leaves less room for some weird technical mishap with Excel.

Option 3: Browser-generated IDs

Browsers are amazing. I know because I have roughly 20 of them open, each with about 5-40 tabs, at any given time. The tab colonies, where I like to spend my holidays. But I digress.

Modern browsers can generate random values using built-in system entropy.

This is technically stronger than Excel, though unnecessary for the relevant threat model.

Use this if it makes your IT person feel calmer. In the absence of a nation-state-level attacker, the benefit over Option 2 is marginal. Use Option 1 if you don’t have an IT person. It’s just fine.

Option 4: Physical randomization

If resources are so minimal that you cannot print or write out a sheet with 200 sequential IDs, or you just like to do things the analogue way…

Assign IDs by:
- rolling dice,
- drawing shuffled index cards,
- sealed envelopes with numbers inside.
Record only the resulting number.

This is still vastly safer than birthdays or initials.

Just remember to not panic when it does not look/sound/feel random. Because randomness doesn’t. Recognizing randomness is hard.

But Clients Forget IDs

Look, this is all very fine and well, but our clients are:

traumatized
housing insecure
generally unlikely to remember random numbers, like most human beings

If there were a legal limit on Post-Its, I would have been arrested a long time ago. I am not saying anyone should rely on subjects’ or clients’ memories to encode random IDs from session to session, or test to test, or survey to survey, or whatever. Of course remembering identifiers should never be the client’s job.

This is an operational memory problem, not a privacy one.

The solution is not to use initial + birthday (which increases identifiability), but to keep identifiers internally and decouple them from the person everywhere else. Then the organization can destroy the linking data later, when enough time has passed that it won’t be needed anymore. Researchers do this all the time.

In other words, the NGO assigns a random ID, stores the mapping internally, uses the ID for invoices, and then destroys the mapping after the client relationship times out.

But wait, a dedicated attacker could break into the NGO’s office and steal the data! Sure, they could. But then it would be obvious when they used the data that they had stolen it, and they would suffer more reputational cost than they would gain legal benefit.

Why This Matters Beyond One NGO

Misunderstandings about data privacy and security are widespread.

In contexts involving sensitive data — such as domestic and sexual abuse victim support, drug abuse rehabilitation, and STD testing — anonymization mistakes compound the risk that intended help will cause harm.

Privacy is not a bureaucratic add-on. It is part of security.

This is not administrivia. It is part of the job.

Please…

If your invoicing does not require identifying information, stop collecting it.

If you tell people you’re anonymizing their data, actually anonymize it.

(Quietly now, to myself: If you get angry when people don’t know something you know, get a grip and explain it.)

Wilde Truth

Discussion about this post

Ready for more?