Modern Database Architecture: Tiered Storage [2026]
Bottom Line
Tiered storage works when local NVMe is treated as a disposable acceleration layer and object storage is treated as the durable source of truth. The hard part is controlling compaction, cache admission, and recovery so tail latency stays predictable under load.
Key Takeaways
- ›Amazon S3 now provides strong consistency for reads and listings after successful object writes.
- ›S3 Express One Zone targets single-digit millisecond access and can be up to 10x faster than S3 Standard.
- ›The best designs keep memory and NVMe for the working set, while S3 holds the durable baseline.
- ›Tiering fails when teams ignore compaction debt, cache hit ratio, and object-store request fan-out.
Modern databases are no longer built around a single storage device profile. The winning pattern in 2026 is bluntly pragmatic: keep the working set close to CPU on memory and NVMe, push durable bulk state to object storage, and let software absorb the latency gap. That sounds simple until write amplification, cache churn, and recovery storms show up. Tiered storage is not a procurement trick. It is a database architecture decision that changes the shape of your read path, control plane, and cost model.
- Amazon S3 now gives strong consistency for reads and listings after successful writes, which removes one historical objection to object-backed data paths.
- S3 Express One Zone is designed for single-digit millisecond first-byte latency and can be up to 10x faster than S3 Standard.
- Cloud-native engines increasingly use object storage as the authoritative layer, with memory and disk acting as acceleration tiers.
- The critical metrics are p95 read latency, cache hit ratio, compaction debt, and object-store request fan-out per query.
The Lead
Bottom Line
Treat NVMe as a fast, replaceable cache and object storage as durable truth. Once that boundary is explicit, database design gets cleaner, scaling gets cheaper, and performance work moves to compaction, caching, and admission control.
The old layout for databases was easy to reason about: compute and durable storage lived on the same box, so locality was strong and failure domains were small. The downside was equally obvious. Capacity scaling dragged compute with it, hardware refreshes were disruptive, and recovery meant restoring or re-replicating large local volumes. Object storage changed the tradeoff. Once services like Amazon S3 offered strong consistency after successful writes and deletes, a major systems objection softened: the durability tier no longer had to be wrapped in as many application-level consistency workarounds.
That does not mean databases can simply point their buffer pool at a bucket and declare victory. Object storage is still a different medium with different economics and latency behavior. Tiered systems succeed because they accept that mismatch and shape around it.
- Hot data stays in DRAM for indexes, memtables, vectorized execution buffers, and the most frequently reused pages.
- Warm data sits on local NVMe or attached SSD, absorbing spill, block caching, and compaction I/O.
- Cold durable data lands in S3 or an equivalent object store, where elasticity and durability dominate raw latency.
- Metadata and coordination remain separate concerns, because page placement, snapshots, and garbage collection still need an authoritative control plane.
The practical question is no longer whether object-backed storage can work. Systems such as Snowflake, ClickHouse Cloud, and RisingWave have already normalized decoupled compute and storage. The real question is which parts of your engine can tolerate indirection, and which parts still need tight local feedback loops.
Architecture & Implementation
Start with the data path, not the storage bill
A sound tiered design follows the read and write paths from CPU outward. Writes typically land in memory first, then flush to immutable files or segments, then compact into larger sorted layouts. Reads typically walk metadata, check memory caches, fall through to disk cache, and finally issue remote fetches. This is why LSM-tree engines adapt naturally to tiering: they already assume immutable runs, background compaction, and multi-level read paths.
RisingWave is a useful reference pattern. Its architecture stores persistent state in object storage, uses an LSM-tree layout organized from L0 to L6, and treats memory plus disk cache as acceleration layers above the object store. That is the right mental model for many modern systems: the durable copy is remote, while local media optimize the working set.
What the tiers are actually responsible for
- DRAM: block cache, row-group metadata, bloom filters, execution buffers, and the smallest repeatedly accessed structures.
- NVMe: read-through cache, spill space for sorts and joins, checkpoint staging, and compaction scratch space.
- S3: immutable data files, snapshots, checkpoint artifacts, and long-lived history.
- Metadata service: manifests, object maps, retention rules, and object garbage collection.
The main implementation mistake is letting those roles blur. If the NVMe tier becomes semantically authoritative, failover and autoscaling get painful. New nodes must rebuild hidden local state before they can serve traffic, and recovery now depends on a resource that was supposed to be disposable. The stronger pattern is to keep local tiers reconstructible from object storage plus metadata.
Control-plane rules matter as much as media choice
Once the durable tier is remote, the database needs explicit rules for cache admission, eviction, promotion, and compaction. If every remote read is blindly copied into NVMe, the cache becomes a write-amplification machine. If compaction ignores cache locality, the engine will keep invalidating its own warm blocks.
- Promote only blocks with repeated reuse, not every cold scan result.
- Prefer immutable file segments as cache units; they are easier to checksum, evict, and reconstruct.
- Separate user-facing latency from background maintenance by rate-limiting compaction and garbage collection.
- Track object-store request fan-out per query, because small fragmented reads often matter more than raw bytes transferred.
# Illustrative tiering policy
hot:
admission: repeated_read or index_block
medium: dram
warm:
admission: remote_block_reuse and size_lte_16mb
medium: nvme
cold:
admission: checkpoint, snapshot, immutable_segment
medium: s3
background:
compaction: rate_limited
eviction: lru_plus_cost
recovery: rebuild_local_from_manifest
If you publish or share policy snippets like this internally, the Code Formatter is a simple way to keep YAML and config examples readable during reviews.
Benchmarks & Metrics
Tiered storage is where lazy benchmarking gets exposed. Teams often run a single synthetic throughput test, see acceptable median latency, and miss the real failure mode: long-tail stalls during cache misses, compaction spikes, or fan-out against many tiny remote objects. Benchmarks have to measure locality transitions, not just steady-state speed.
Use a benchmark matrix, not one number
| Metric | Why it matters | Failure signal |
|---|---|---|
| p50 / p95 / p99 read latency | Shows how often queries fall through local tiers | Tail latency rises sharply after cache churn |
| Cache hit ratio | Confirms whether NVMe is serving a real working set | High read bytes but low hit rate |
| Compaction debt | Measures how much background maintenance is deferred | L0 growth and read amplification climb together |
| Remote request fan-out | Captures object-store overhead beyond transferred bytes | Many small reads dominate wall-clock time |
| Recovery time | Tests whether local state is truly disposable | Node restart requires long cache rebuilds |
Benchmark the boundaries between tiers
- Run a warm-cache pass and a cold-cache pass. The difference tells you whether the NVMe tier is critical or ornamental.
- Measure scan-heavy and point-read workloads separately. The best cache admission policy for one is often wasteful for the other.
- Include write-heavy windows. Compaction can turn an apparently cheap design into a latency outlier once flush pressure rises.
- Test failure and scale-out events. A tiered system that only works after hours of warming is not operationally elastic.
The newest wrinkle is that the object layer itself now has internal tiers. S3 Express One Zone is explicitly positioned as a higher-performance S3 class, with single-digit millisecond first-byte access and substantially higher request rates than standard directory-bucket defaults. That gives architects a new middle option between local NVMe and conventional S3. It is not a substitute for memory, but it can reduce the penalty of remote truth for latency-sensitive paths that still need object semantics.
What good looks like in practice
- Working set fits in DRAM: remote reads are rare, and NVMe mostly absorbs spill and maintenance traffic.
- Working set exceeds DRAM but fits in NVMe: latency stays stable, object-store cost remains controlled, and cache hit ratio remains high.
- Working set exceeds both: the system still behaves predictably because admission, file sizing, and compaction prevent request explosion.
That last case is the real test. If your database only looks good while hot data fits locally, you do not have a tiered architecture. You have a fast cache attached to a slow surprise.
Strategic Impact
The strategic advantage of tiered storage is not just lower cost per terabyte. It is operational decoupling. Compute nodes can scale for concurrency, not for archive capacity. Recovery gets simpler because durable state is centralized. Storage classes can evolve without rewriting the entire engine. And platform teams can reason about retention, geography, and compliance at the object layer without pinning those concerns to every compute host.
- Elastic scale: compute can grow independently from durable storage, which is the foundation of modern analytic and streaming platforms.
- Faster replacement: local disks can be treated as expendable, reducing the operational drag of node failure.
- Clearer FinOps: hot capacity, warm cache, and cold retention become separate budget levers instead of one oversized cluster bill.
- Better governance: snapshots, exports, and long-lived object prefixes can be classified, retained, and audited consistently.
There is also a security angle teams routinely underestimate. Once more historical state lives in object storage, more operational artifacts become shareable and long-lived: manifests, benchmark traces, snapshots, and recovery bundles. Before passing those around outside the core team, sanitize anything sensitive with the Data Masking Tool. In practice, tiered architectures widen the surface area of metadata exposure long before they widen the surface area of raw data exposure.
Where teams still underinvest
- Observability: they graph bytes and miss request counts, queue depth, and compaction backlog.
- File sizing: too many tiny objects quietly destroy remote-read efficiency.
- Admission control: indiscriminate cache fill turns warm tiers into churn generators.
- Recovery drills: they assume local rebuild is cheap but never test restart storms or regional failover.
Road Ahead
The next phase of tiered storage is not a single breakthrough. It is a stack of smaller optimizations that make remote durability feel increasingly local. We are already seeing three directions emerge.
- Smarter object tiers: higher-performance object classes create a real warm-remote option between NVMe and deep object storage.
- More explicit cache economics: engines will make admission decisions based on reuse probability, not just recency.
- Background automation: systems like Databricks increasingly automate OPTIMIZE, VACUUM, and layout maintenance because humans are bad at tuning them continuously.
- Better failure semantics: recovery paths will be optimized as first-class workloads, not as side effects of read caching.
That last point matters most. The modern database is converging on a design where object storage is the durable substrate, local tiers are acceleration layers, and software decides what deserves proximity to CPU. NVMe still matters enormously, but less as a place to keep the only copy of important data and more as a tactical latency shield. The teams that win with tiered storage will be the ones that engineer the movement between tiers, not just the tiers themselves.
Frequently Asked Questions
Does Amazon S3 strong consistency remove the need for a database metadata layer? +
When should NVMe be a cache instead of the primary database store? +
remote durable + local accelerators.How do you benchmark tiered database storage fairly? +
Is S3 Express One Zone a replacement for local NVMe? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
Disaggregated Compute and Storage in Cloud Databases
How modern engines separate state from compute without giving up operational control.
System ArchitectureLSM-Tree Compaction Explained for Production Systems
A practical guide to write amplification, level sizing, and compaction debt.
Developer ReferenceObject Storage Performance Patterns for Data Platforms
What actually drives latency and cost when databases lean on object-backed state.