GDPR and Event Sourcing: Thoughts on a Hard Problem

Dec 10, 20245 min read

Event streams are append-only by design. GDPR demands deletion. Here is how I think about reconciling the two, and the approach I prefer in practice.

Event Sourcing and GDPR look, at first glance, like a contradiction in terms.

An event store is append-only by design. Every fact that ever happened is preserved, immutable, replayable. That is the whole point. It is also what makes it powerful: a complete, auditable history of your system, with the ability to rebuild any read model from scratch.

GDPR, on the other hand, hands users a hammer called the right to be forgotten. They can ask you to remove their personal data, and you have to comply. Not "kind of." Not "soon." Actually remove it.

So how do you square an append-only log with a legal requirement to delete?

The usual answers

There are a few patterns that come up repeatedly in event-sourced systems. Each has trade-offs.

1. Hard delete: rewrite or drop events

The simplest mental model: when a user asks to be forgotten, find every event that mentions them and delete it.

It works, but it breaks the core promise of the event store. Projections built from the stream may no longer be reproducible. Aggregates that depended on those events may refuse to rehydrate. Audit trails get holes. And if your event store relies on cryptographic chaining or strict ordering, you may invalidate the whole structure.

For some domains this is acceptable. For most it is not.

2. Crypto-shredding

A more elegant variant: encrypt personal data inside events with a per-subject key, and store the key separately. When deletion is requested, throw away the key. The event remains, but the personal payload is now unreadable noise.

This is clean from a storage standpoint and respects immutability. The downside is operational complexity: you need a proper key management story, key rotation, backups that don't accidentally preserve old keys, and a clear understanding of what counts as "personal data" inside each event.

It also doesn't help if personal data leaked into fields you didn't think to encrypt: free-text comments, addresses copied into unrelated events, log-like payloads. Once it's plain text in the stream, shredding won't save you.

3. External personal data, references in the stream

This is the approach I keep coming back to, and the one I think is the best default for new systems.

The idea is simple: the event stream does not contain personal data directly. Instead, events carry references to records that live in a separate, mutable store specifically designed to be deletable.

So an event like UserSignedUp does not embed the email, name, or address. It embeds a stable identifier, and the personal data sits in a side store keyed by that identifier.

When a user invokes their right to be forgotten:

You delete the row in the personal-data store.
The event stream stays untouched.
Projections that need personal data join against the side store at query or projection-build time.
If the side data is gone, the projection naturally renders an empty or anonymized value.

Why I prefer the reference approach

A few reasons make it my default:

The event store stays honest. Append-only is preserved. History is still complete in the sense that matters: what happened, when, and in what order.
Deletion is a real delete. Not a key thrown away, not a tombstone, not a flag. The personal data is gone, in one place, with one operation.
Replay still works. You can rebuild projections from the stream at any time. Personal data simply joins in (or doesn't) at projection time.
Auditing remains meaningful. You can still prove that an action occurred. You just can't reproduce who it referred to once they have been forgotten. That is usually exactly what GDPR wants.
The boundary is explicit. Engineers have to make a conscious decision to put personal data into an event. That friction is a feature, not a bug.

The cost is that you have to design for it from the start. Retrofitting references into a system that has been embedding personal data into events for years is painful. Possible, but painful.

What about projections and read models?

Projections built from the stream may have copied personal data into their own tables. Those are derived data and they must also be cleaned up.

Two patterns work well:

Rebuild on demand. After deletion in the side store, rebuild the affected projections. Since they join against the now-empty personal-data store, the personal fields naturally disappear.
Cascade deletes. Treat projection rows that contain personal data the same way you treat the source row: tied to the same subject identifier, deleted together.

In both cases, the principle is the same: personal data has a single owner, the side store, and everything else is a view that can be regenerated.

The trade-off nobody talks about

Whichever approach you pick, there is one trade-off you cannot avoid: immutability and deletability are in genuine tension. You can hide it, push it around, encode it cleverly, but it is always there.

What you are really choosing is where you take the hit:

Hard delete sacrifices the integrity of the log.
Crypto-shredding sacrifices simplicity for key management complexity.
External references sacrifice some up-front design effort and accept that projections must be join-aware.

For most systems I build today, the external-references model wins. It keeps the event store doing what it is good at, keeps deletion straightforward, and keeps the legal story easy to explain.

Closing thought

GDPR is not the enemy of event sourcing. It is a useful constraint. It forces you to be explicit about what counts as personal data, where it lives, and who is responsible for removing it.

A well-designed event-sourced system treats personal data as a separate concern from the event stream itself. Events describe what happened. References point to who it happened to. And when "who" needs to disappear, only one place has to change.

That, to me, is privacy by design, not by accident.