Abstractions of Database Design

#databases

At any point in time, the state of a database can be thought of as the accumulation of all committed state changes applied in order up to the last. Depending on how a database is implemented, whether various data structures live in memory or on disk, a database may or may not be able to guarantee durability. If the power goes out before it gets a chance to write changes to the disk, those changes are lost.

It’s common for databases to write inserts and updates to a write-ahead log. If we consider the state of a distributed system, then write-ahead logging is effectively equivalent to the event sourcing architecture pattern.

Above the WAL, we commonly see an in-memory index. For databases like Google’s Bigtable or Apache’s Cassandra, this takes the form of an in-memory tree structure (AVL tree, red-black tree, etc…) called a memtable.

As writes are persisted to the write-ahead log, they are also added to the memtable. Periodically these memtables are flushed and written to disk in the form of an SSTable (Sorted Strings Table). Each SSTable holds a sorted disk representation of the data in the memtable. This means that multiple SSTables may have the same key, but reads can be performed in reverse (WAL (maybe), memtable, SSTablen, SSTablen-1, …, SSTable1 until the key is found.

Obviously this would mean that keys that don’t exist in the database perform a great deal of work for nothing. Bloom Filters are used to optimize reads of some keys that don’t exist in the database (not all keys are optimized due to the properties of Bloom Filters).

On the relational side, B-trees and B+-trees are essentially multi-level indexes into a table by a specific key. They are structured on-disk to be local for quick retrieval.

Some abstractions can be inferred here:

Durability is the product of immediate persistance, usually at the sacrifice of write throughput.
Buffering or batching writes can improve write throughput at the sacrifice of durability.
There will always be a trade-off between performance and consistency.
It is easier to immediately persist events rather than records.
Repairability is the product of immediately persisting events rather than records.
Atomicity is the product of maintaining an equivalent* order of these events across the system.
It is possible to merge and compact events only if there is an authoritative source of truth (whether that’s a single entity or with multiple entities reaching consensus).
Systems have the capacity for optimizations based their usage patterns, but often at the tradeoff of other operations.
Under the same circumstances, a series of events leading to the current application state can also be compacted. (It comes down to available storage resources and the desired granularity of snapshots.)

*I’m making the assumption here that equivalence does not mean congruence (events may appear in different orders if they are guaranteed to not modify the same state; see logical clocks, CDRTs, etc…).