Why Your Object Storage Is Slow (And How Parallelism Over HDDs Fixes It)

Ever tried to serve a petabyte per second from hard drives that top out at ~200 MB/s each? Sounds impossible, right? That's the exact problem large-scale object storage systems face, and the solution is a masterclass in distributed systems design.

I've been digging into published research papers on how massive object stores actually work under the hood — specifically the architectures behind systems that serve tens of millions of hard drives simultaneously. If you're building anything that needs to scale storage throughput, this is worth understanding.

The Core Problem: HDDs Are Painfully Slow

Let's do some napkin math. A modern HDD gives you roughly 100-200 MB/s sequential read throughput. If you want to serve 1 PB/s (that's 1,000,000 GB/s), you'd need somewhere around 5-10 million drives just for raw bandwidth — assuming every drive is perfectly utilized 100% of the time.

Spoiler: they're never perfectly utilized.

The real problem is threefold:

Seek time kills random reads. HDDs have mechanical heads that physically move. Random 4K reads might give you 1-2 MB/s.
Individual drives fail constantly. At millions of drives, you're losing hundreds per day.
Request routing is non-trivial. You need to find which drive(s) hold a given object, fast.

So how do you build a system that turns millions of slow, unreliable disks into a fast, reliable storage layer?

Step 1: Erasure Coding Over Replication

The naive approach is triple replication — store three copies of everything. It works but wastes 200% extra storage. At exabyte scale, that's an absurd cost.

Erasure coding is the fix. The idea is similar to RAID but across a distributed cluster. You split data into k data chunks and compute m parity chunks, giving you k+m total chunks spread across different drives and failure domains.

python

# Conceptual erasure coding example
# Reed-Solomon coding with k=8 data chunks, m=3 parity chunks
import reed_solomon  # pseudocode — real implementations use ISA-L or similar

def encode_object(data: bytes, k: int = 8, m: int = 3) -> list[bytes]:
    chunk_size = len(data) // k
    data_chunks = [data[i*chunk_size:(i+1)*chunk_size] for i in range(k)]
    
    # Generate parity chunks using Galois field arithmetic
    parity_chunks = reed_solomon.encode(data_chunks, m)
    
    # Total: 11 chunks, can tolerate any 3 failures
    # Storage overhead: 37.5% vs 200% for triple replication
    return data_chunks + parity_chunks

With an 8+3 scheme, you can lose any 3 chunks and still reconstruct the full object. The storage overhead drops to ~37% instead of 200%. At scale, this difference is enormous.

The tradeoff? Reads now need to fetch from multiple drives and reassemble. But this actually plays into the parallelism story.

Step 2: Massive Parallelism Hides Latency

Here's the counterintuitive insight: spreading data across more drives makes reads faster, not slower.

When an object is split across, say, 11 drives using erasure coding, you can read all 11 chunks in parallel. Each drive only needs to serve a fraction of the total data. Your effective throughput becomes:

text

throughput = single_drive_speed × number_of_parallel_reads

For a 1 GB object split across 8 data chunks, each drive only reads ~125 MB. At 200 MB/s per drive, that's 0.6 seconds instead of 5 seconds for a sequential read from one drive.

But it gets better. Since you have m extra parity chunks, you can issue k+m reads and take the fastest k responses. This is called speculative reads or hedged requests:

python

import asyncio

async def read_object_fast(chunk_locations: list[str], k: int, m: int):
    """Read k+m chunks in parallel, use first k that return."""
    tasks = [
        asyncio.create_task(fetch_chunk(loc)) 
        for loc in chunk_locations  # all k+m locations
    ]
    
    results = []
    for coro in asyncio.as_completed(tasks):
        chunk = await coro
        results.append(chunk)
        if len(results) == k:  # got enough to reconstruct
            # Cancel remaining slow reads
            for t in tasks:
                t.cancel()
            return reconstruct(results)

This naturally routes around slow drives, busy drives, and even drives that are in the process of failing. The tail latency improvement is massive.

Step 3: Separating Metadata from Data

One of the most important architectural decisions is keeping metadata in a separate, fast tier. You absolutely do not want to hit an HDD just to figure out where your data lives.

The metadata layer typically handles:

Object-to-chunk mapping (which chunks make up this object?)
Chunk-to-drive mapping (which physical drives hold each chunk?)
Consistency and versioning
Bucket and access control metadata

This layer lives on SSDs or in memory and is itself a distributed database. When a request comes in, the flow looks like:

Metadata lookup (fast, SSD/memory) → returns chunk locations

Parallel chunk reads (HDDs) → fetch actual data

Reassembly (compute) → erasure decode if needed

The metadata tier can use something like a lightweight LSM-tree-based key-value store optimized for this workload. Some systems use a custom single-shard storage engine that's fine-tuned for the specific I/O patterns of chunk management — minimizing write amplification while keeping read latency predictable.

Step 4: Smart Placement and Failure Domains

You can't just throw chunks randomly across drives. You need to ensure that:

No two chunks of the same object land on the same drive
Chunks are spread across different racks, power domains, and availability zones
Rebalancing after a drive failure doesn't create hotspots

python

# Simplified placement strategy
def place_chunks(object_id: str, num_chunks: int, cluster: Cluster) -> list[Drive]:
    placements = []
    used_racks = set()
    used_power_domains = set()
    
    for i in range(num_chunks):
        candidates = [
            d for d in cluster.available_drives()
            if d.rack not in used_racks           # different rack
            and d.power_domain not in used_power_domains  # different power
            and d.utilization < 0.85               # not overloaded
        ]
        
        # Weighted random selection favoring less-loaded drives
        drive = weighted_random(candidates, key=lambda d: 1.0 - d.utilization)
        placements.append(drive)
        used_racks.add(drive.rack)
        used_power_domains.add(drive.power_domain)
    
    return placements

This placement awareness is what keeps the system durable even when an entire rack loses power.

Step 5: Background Repair Is Continuous

At the scale of millions of drives, failures aren't events — they're a constant background hum. Drives die, bit rot happens, sectors go bad. The system needs to continuously verify and repair data without impacting live traffic.

The repair process:

Detection: Continuous background checksumming catches corruption early
Prioritization: Objects with fewer remaining healthy chunks get repaired first
Throttling: Repair I/O is lower priority than customer reads/writes
Spreading: Repair reads and writes are distributed across many drives to avoid creating new hotspots

The key insight is that repair speed matters a lot. If a drive dies and it takes too long to re-replicate its data, the probability of a second correlated failure (same batch, same rack) increases. Fast, parallelized repair across many drives is essential.

What This Means for Your Architecture

You probably aren't building a storage system at this scale. But the principles apply broadly:

Parallelize aggressively. Multiple slow things in parallel beat one fast thing. This applies to database queries, API calls, file processing — anything with latency.
Hedge your bets. Sending redundant requests and taking the fastest response is a powerful latency optimization. Jeff Dean has been preaching this for years.
Separate your metadata path. Keep the "find the thing" path fast and separate from the "read the thing" path. This is why databases use indexes.
Design for continuous failure. At any meaningful scale, something is always broken. Build repair and redundancy into the steady state, not as an exceptional recovery mode.

The next time someone tells you HDDs are too slow for modern workloads, remember: it's not about the speed of one drive. It's about how many you can orchestrate in parallel.

The Core Problem: HDDs Are Painfully Slow

Step 1: Erasure Coding Over Replication

Step 2: Massive Parallelism Hides Latency

Step 3: Separating Metadata from Data

Step 4: Smart Placement and Failure Domains

Step 5: Background Repair Is Continuous

What This Means for Your Architecture

Further Reading