How AWS S3 Achieves 1 Petabyte Per Second on Hard Disk Drives
In the world of cloud computing, Amazon Web Services (AWS) stands as a titan, and its Simple Storage Service (S3) is arguably the backbone of the modern internet. From hosting static websites and storing application assets to holding vast lakes of big data, S3 is the go-to solution for scalable, durable, and highly available object storage.
Given its cutting-edge performance, one would assume it runs on the latest and greatest Solid-State Drive (SSD) technology. However, the reality is far more surprising and architecturally fascinating: AWS S3, even today, is built predominantly on top of "slow" and "archaic" Hard Disk Drives (HDDs).
How is it possible that a service capable of handling over 150 million requests per second and delivering a peak throughput of over one petabyte per second relies on mechanical, spinning platters? The answer lies not in the hardware itself, but in the brilliant and multi-layered software architecture that Amazon's engineers have built to transform the inherent weaknesses of HDDs into a massive, scalable strength.
In this tutorial, we will embark on a deep dive into the inner workings of AWS S3. Drawing inspiration from a comprehensive analysis by Stanislav Kozlovski, we will unravel the engineering marvels that allow S3 to achieve its colossal scale.
We will explore the economic rationale behind choosing HDDs, understand the fundamental physics that govern their performance, and break down the sophisticated techniques—from massive parallelism and erasure coding to advanced load balancing—that make this counterintuitive architecture a resounding success. You will see why sometimes the oldest technology, used cleverly, can power some of the most modern systems on the planet.
Understanding the colossal scale of Amazon S3
Before we can appreciate the ingenuity of S3's architecture, we must first grasp the sheer magnitude of its operation. It's one thing to build a storage system for a single application; it's another thing entirely to build a global, multi-tenant service that supports millions of customers with diverse and demanding workloads. The numbers behind S3 are nothing short of staggering and paint a clear picture of the engineering challenge at hand.
As of recent figures, the scale of AWS S3 includes:
- Over 400 Trillion Stored Objects: This is a number so vast it's difficult to comprehend. Each object, whether it's a tiny log file or a massive video archive, needs to be stored durably and be readily accessible.
- 150 Million Requests Per Second: At its peak, the S3 system processes an incredible volume of
GET,PUT,LIST, andDELETEoperations every single second, serving data to applications and users around the globe without skipping a beat. - Over 1 Petabyte Per Second of Peak Traffic: One petabyte (PB) is equivalent to 1,000 terabytes, or roughly one million gigabytes. To serve this much data every second requires an extraordinary level of throughput that far exceeds the capabilities of any single piece of hardware.
- An Infrastructure of Tens of Millions of Hard Drives: Powering this entire operation is a colossal fleet of commodity HDDs, the very technology many consider outdated for high-performance applications.
These metrics highlight that S3 is not just a storage service; it's a planet-scale distributed system. Achieving this level of performance, availability, and durability, all while keeping costs low enough to be competitive, required Amazon's engineers to rethink storage from the ground up. Their journey began with a fundamental and counterintuitive hardware choice.
The counterintuitive choice: why hard disk drives (HDDs)?
In an age where SSDs offer lightning-fast access times and superior random I/O performance, the decision to build a high-performance system like S3 on HDDs seems perplexing. HDDs are mechanical devices with spinning platters and moving read/write heads, making them inherently slower and more prone to physical failure than their solid-state counterparts. So why did AWS make this choice, and why do they stick with it?
The unbeatable economics of HDDs
The single most significant reason for choosing HDDs is cost. At the massive scale at which AWS operates, even small differences in unit price amplify into enormous financial implications. While the price of all forms of storage has plummeted over the decades, a significant gap remains between HDDs and SSDs.
As the data shows, in 2023, the cost per terabyte for an HDD was approximately $11, while the equivalent for an SSD was $26. This means SSDs are more than twice as expensive for the same amount of storage capacity. When you're purchasing and maintaining tens of millions of drives, this difference is monumental. Choosing HDDs allows AWS to offer storage at an extremely competitive price point, making it an accessible and attractive option for a broad range of customers and use cases. This economic advantage is the foundation upon which S3's business model is built.
The inherent performance problem with HDDs
While the cost benefits are clear, they come with a significant trade-off: performance. The mechanical nature of HDDs introduces physical limitations that have remained largely unchanged for decades.
HDD vs. SSD physics
- Solid-State Drives (SSDs): These devices have no moving parts. They store data on flash-memory chips and use electricity to read and write data. Since electrons travel at a significant fraction of the speed of light (roughly 50%), access times are measured in microseconds.
- Hard Disk Drives (HDDs): These devices are composed of one or more spinning magnetic platters and a read/write head mounted on a mechanical arm (the actuator). To access data, the actuator must physically move the head to the correct track, and then wait for the platter to spin to the correct sector. This mechanical movement is orders of magnitude slower than the electronic access of an SSD.
Breaking down HDD latency
Accessing a piece of data on an HDD is a multi-step mechanical process, and each step adds to the total latency:
- Seek Time: This is the time it takes for the actuator arm to move the read/write head to the correct track on the platter. For a random request, this can take up to 25 milliseconds (ms).
- Rotational Latency: Once the head is on the correct track, the drive must wait for the platter to rotate so that the desired data sector is underneath the head. On average, this requires half a rotation and can take up to 8.3 ms.
- Transfer Rate: Finally, the data is read from the platter and transferred to memory. The time this takes depends on the amount of data. For a typical 0.5 MB block, this might take around 2.5 ms.
These physical constraints mean that an HDD has a very limited number of Input/Output Operations Per Second (IOPS) it can perform, typically around 120. This performance bottleneck is the core challenge that AWS had to solve.
From limitation to advantage: mastering sequential access
The key to unlocking the potential of HDDs lies in understanding and exploiting their strengths while mitigating their weaknesses. The biggest weakness of an HDD is random access, which requires constant, time-consuming movement of the actuator. However, its greatest strength is sequential access.
The power of sequential vs. random access
- Random Access: This involves reading or writing data from various non-contiguous locations on the disk. Each operation forces the actuator to seek a new position, incurring the full latency penalty every time. This is what HDDs are notoriously bad at.
- Sequential Access: This involves reading or writing a continuous, unbroken stream of data. Once the head is positioned at the start of the stream, it can remain relatively stationary. The data flows off the platter as it spins naturally, allowing the drive to achieve its maximum transfer rate with minimal mechanical delay.
By designing a system that favors sequential access patterns, you can make an HDD perform remarkably well. This is precisely the principle S3's architecture is built upon.
Introducing "the log" data structure
To enforce sequential access, S3's underlying system treats its storage medium like a "Log". A log, in this context, is a simple yet powerful data structure: an append-only, ordered sequence of records.
When new data is written, it is simply appended to the end of the log. This is a purely sequential operation. The system doesn't need to find an empty spot or overwrite old data in place; it just goes to the end and writes. This pattern is a perfect match for HDD mechanics. It allows writes to occur at the drive's maximum speed, as it minimizes seek time and rotational latency. This concept is so effective that it's also the foundational data structure for other high-throughput distributed systems like Apache Kafka.
Consequently, for AWS S3, handling write operations is relatively straightforward. By performing them sequentially, they can take full advantage of the HDD's performance characteristics. The real challenge, however, comes with reading that data back.
Solving the read problem: the magic of parallelism and erasure coding
While writes can be streamlined into a sequential pattern, reads are often inherently random. A user might request any one of the 400 trillion objects at any time. If the system had to perform a slow, random seek on a single HDD for every read request, it would never be able to handle 150 million requests per second.
The challenge of random reads
As we calculated earlier, a random read on a single HDD has an average latency of around 16 ms. This translates to a peak read capacity of only about 32 megabytes per second (MB/s). To reach one petabyte per second (1,048,576 GB/s), you would need to overcome this limitation by a factor of over 30 million. The solution is not to make a single drive faster, but to use an immense number of them at the same time.
Achieving massive throughput with massive parallelism
The core principle S3 uses to achieve its incredible read throughput is massive parallelism. Instead of storing a large file on a single disk, S3 splits the object into many smaller chunks and spreads those chunks across thousands of different hard drives.
When a client requests that object, S3 doesn't read from just one drive. It initiates parallel read operations to all the drives holding a chunk of that object. The data streams back from all these drives simultaneously, and the system reassembles the chunks into the original object. The total throughput is no longer limited by a single drive; it becomes the sum of the throughputs of all the drives working in parallel.
How erasure coding enables parallelism and durability
To split data effectively and ensure its safety, S3 uses a technique called Erasure Coding. This is a method of data protection in which data is broken into fragments, expanded, and encoded with redundant data pieces (known as parity shards).
Here's how it works in S3's case:
- An object is broken into
kdata shards. - The system then generates an additional
mparity shards using a mathematical function (like Reed-Solomon coding). - All
k + mshards are then stored on different physical drives.
The magic of erasure coding is that the original object can be reconstructed from any k of the available shards. AWS S3 typically uses a 5-of-9 configuration. This means an object is split into 5 data shards, and 4 parity shards are generated. The total of 9 shards are stored across 9 different drives. To read the object, the system only needs to access any 5 of those 9 shards.
This approach provides two critical benefits:
- High Performance: A read request can be satisfied by the 5 fastest-responding drives. If some drives are slow or busy, the system can rely on others without a significant delay.
- Extreme Durability: The system can tolerate the complete failure of up to 4 drives holding shards of the object and still be able to reconstruct the original data perfectly. This is how S3 achieves its famous "eleven nines" (99.999999999%) of durability.
Taming the beast: how S3 solves the hot partition problem
Spreading data across millions of drives solves the throughput problem, but it introduces another complex challenge: the Hot Partition Problem. A "hot partition" (or a "hot spot") occurs when a specific disk or set of disks receives a disproportionately high amount of traffic, causing it to become overloaded. This single bottleneck can slow down the entire system and even lead to cascading failures as requests get rerouted to other, already busy disks.
S3 employs a sophisticated, three-pronged strategy to ensure that load is distributed evenly and hot spots are avoided.
Solution 1: randomized data placement with "the power of two choices"
When writing new data, the system needs to decide where to place the shards. The ideal choice would be the least-loaded disk in the entire fleet. However, querying the status of millions of disks to find the absolute best one for every write would be incredibly inefficient and create its own bottleneck.
Instead, S3 uses a simple but remarkably effective randomized algorithm based on a load-balancing principle called "The Power of Two Random Choices." The phenomenon states that when placing an item, choosing the least-loaded of two completely random nodes yields exponentially better load distribution than choosing just one random node.
In practice, for each shard S3 needs to store, it does the following:
- It selects two hard drives at random from the massive fleet.
- It checks the available space or current load on both drives.
- It places the shard on the drive that is less loaded (i.e., has more free space).
This incredibly simple strategy is statistically proven to prevent hotspots from forming by distributing the load almost as effectively as if the system knew the state of every single drive, but with a tiny fraction of the overhead.
Solution 2: proactive rebalancing
Even with smart initial placement, usage patterns can change over time. To handle this, S3 employs proactive rebalancing. This is guided by a key observation: newer data is almost always "hotter" (accessed more frequently) than older data.
AWS leverages this insight in two ways:
- Routine Maintenance: S3's background processes are constantly at work. They identify old, "cold" data that is rarely accessed and migrate it to less busy, archival-tier disks. This frees up space on the high-performance "hot" disks for new data that is likely to be accessed frequently.
- Hardware Expansion: When AWS adds a new, empty server rack to its infrastructure, the system immediately begins rebalancing. It identifies the most heavily loaded existing disks and starts migrating some of their data shards to the new, empty drives. This process evenly distributes the load across the newly expanded fleet.
Solution 3: workload decorrelation at massive scale
The final piece of the puzzle is a statistical phenomenon that only emerges at S3's immense scale: workload decorrelation. While the workload of a single user or application can be very "bursty"—with periods of inactivity followed by sudden, intense spikes in requests—the aggregate workload of the entire S3 system is remarkably stable and predictable.
This is an application of the Law of Large Numbers. Because S3 is serving millions of independent customers, the random peaks of one customer's workload are offset by the random valleys of another's. All these uncorrelated bursts average each other out. The result is that the overall load on the S3 system is smooth and lacks the extreme peaks that would otherwise cause hot partitions. This natural stabilization allows the S3 team to provision capacity with high confidence and run their infrastructure with exceptional efficiency. It's a powerful advantage that can only be leveraged by a system operating at a planetary scale.
Final thoughts
The story of AWS S3 is a masterclass in systems design. It shows that with a deep understanding of hardware fundamentals and clever software engineering, you can build one of the world's highest-performing systems on top of modest, inexpensive hardware. The decision to use HDDs, driven by economics, forced a cascade of brilliant architectural choices that ultimately defined the service's success.
By embracing the sequential nature of HDDs with a log-structured design, S3 made writing data incredibly efficient. By overcoming the read bottleneck with massive parallelism and the fault tolerance of erasure coding, it achieved exceptional throughput and durability. And finally, by taming the complexity of a million-disk system with smart randomization, proactive rebalancing, and the statistical power of workload decorrelation, it ensured stability and consistent performance at a scale few can imagine.
The next time you upload a file to S3, remember that it's being intelligently fragmented, encoded, and distributed across a vast fleet of spinning disks, not just "saved" in one place. All of this is orchestrated by one of the most sophisticated software systems ever built.