Writing

Thoughts on Building My Own Event Store

Aug 12, 20245 min read

Thinking through what it would take to build my own event store: storage layout, performance, indexing, and the part that scares me most, write locking.

I keep coming back to the same idea.

I want to build my own event store. Not a wrapper around Postgres. Not a plugin on top of Kafka. A real, dedicated engine that treats events as the primary thing, not as rows that happen to be append-only.

I have used enough event-sourced systems by now to know what I like and what I do not. Most existing stores are either too generic, too operationally heavy, too closed, or too expensive to be a default choice. The mental model of an event store is actually quite small. The hard parts are the boring ones: durability, ordering, indexing, replay, snapshots, observability.

Before I start, a few things I want to think through honestly.

Why an event store at all

An event store is conceptually narrow:

  • Append events.
  • Read them back, in order, by subject or stream.
  • Notify subscribers when new ones arrive.
  • Guarantee that what you wrote is what you read.

That is most of it. Everything else is variations on the theme. The attraction of building one is that the surface area is small enough to be fully understood, but the constraints are real enough to teach you something on every layer: storage, concurrency, networking, protocol design.

It is the kind of project where you cannot hide behind a framework.

Language choice

Go or Rust. I will probably go with Go, because it is what I am fastest in and the concurrency model maps cleanly to what I want to build. Rust is tempting for the discipline it forces, but a real engine that exists beats a Rust engine that lives in my head. Not the interesting question right now.

How will events be stored?

The model I want to start from:

  • Append-only segment files on disk. New events go into the current segment. When it reaches a size threshold, it is sealed and a new one opens.
  • Framed records. Each event is length-prefixed, with a small header (version, type, length, checksum) followed by the payload.
  • Per-event sequence numbers assigned at append time, monotonic across the whole store. This is the canonical order.
  • Per-subject (or per-stream) logical ordering derived from the global sequence plus a subject identifier.

The point of segmented append-only files is that the hot path is dead simple: write to the end, fsync according to policy, done. No in-place updates, no rewrites, no fragmentation. Recovery on crash is "scan the tail of the last segment until you find a bad checksum, truncate there."

Sealed segments are immutable. That makes them friendly to compression, snapshotting, and replication later.

How do I ensure performance?

I do not want to optimize before I have something to measure, but a few principles I want to bake in from the start:

  • Sequential I/O on the write path. Always append, never seek.
  • Group commits. Buffer writes for a few hundred microseconds, fsync once, acknowledge a batch of clients together. This is the trick that makes durable writes cheap.
  • Zero-copy reads where possible. Memory-map sealed segments and let the kernel page cache do the heavy lifting.
  • Backpressure everywhere. Slow consumers must not push back against the write path. Watchers get queues with bounded sizes and explicit overflow behavior.
  • Streaming over batching at the API edge. JSONL (newline-delimited JSON, also known as ndjson) over HTTP, with long-lived connections for watching. One event per line, easy to parse incrementally, debuggable with curl, no special client needed.

The boring stuff matters more than the clever stuff. A clean write path and honest fsync semantics beat any micro-optimization.

How do I design the index?

This is the part where it is easy to over-engineer.

At minimum I need two things: a way to jump to an event by its global position, and a way to read all events for a given stream in order. Both can start as small lookup files written next to the segments, plus a bit of in-memory state for the recent tail so watchers do not have to hit disk.

The important property is that the index is derived data. If it gets corrupted or falls behind, I want to rebuild it by scanning the segments, not recover it from a backup. Anything fancier, secondary indexes, full querying, can stay out of the engine and live in projections.

Write locking is hard

Honestly, this is the part that scares me most. An event store has to give strong guarantees: a global order that does not lie, per-stream ordering that agrees with it, and optimistic concurrency so two callers cannot both append the "next" event for the same stream without one of them losing.

The naive design is a single writer that everything funnels through. Ordering becomes trivial, version checks become trivial, and the only real cost is throughput. The more interesting designs, sharded writers, fine grained per-stream locks, all immediately raise harder questions about how the global order is assigned and what happens around fsync.

I will almost certainly start with the boring single-writer version, and only reach for something cleverer if it actually becomes the bottleneck. Correctness is not something I can bolt on later.

What I want to feel when it works

I want to be able to:

  • Start a single binary.
  • Point an HTTP client at it.
  • Stream events in.
  • Stream events back out, in order, with a clean watch story.
  • Take a snapshot. Restore it on another machine. Get exactly what I had.
  • Read the logs and metrics and understand what the system is doing.

If that experience is good, the rest is iteration. If that experience is bad, no amount of features will save it.

That is roughly where my head is. Time to stop circling the idea and start writing code.