Back To Envelope Or Capacity Estimation
Mastering Back-of-the-Envelope Calculations for System Design
Capacity estimation, or "back-of-the-envelope" calculation, is a foundational skill in system design. It's what separates a good design from a great one. Think of it as a feasibility check that prevents costly redesigns and ensures your architecture can handle real-world traffic.
This guide will walk you through a simple framework, key formulas, and the essential numbers you need to know to confidently tackle these estimations.
Essential Numbers to Know 🧠
Before you start calculating, you need a baseline. In an interview, you're expected to know some common latency and data size figures. Memorize these!
Read from L1 Cache
~0.5 ns
Read from L2 Cache
~7 ns
Main memory read (RAM)
~100 ns
Round trip within the same data center (DC)
~0.5 ms
Read 1 MB sequentially from memory
~20 μs
Read 1 MB sequentially from SSD
~1 ms
Read 1 MB sequentially from HDD
~20 ms
Round trip from US East to US West
~40-50 ms
Round trip from US to Europe
~100-150 ms

Common Data Size Assumptions:
Integer/Long: 4-8 bytes
UUID: 16 bytes
Character (UTF-8): 1-4 bytes
Average URL:
~100
charactersAverage text post (e.g., Tweet):
~300
bytesAverage image size:
200
KB -2
MBAverage database record:
1-10
KB
Key Metrics for Capacity Estimation
Traffic Metrics
Daily Active Users (DAU)
The number of unique users who use your application in a single day
Directly impacts server load and resource requirements
Formula: Total unique users accessing the system per day
Queries Per Second (QPS)
The number of requests your system processes every second
Critical for determining server capacity and load balancing needs
Formula: Total daily requests ÷ 86,400 seconds
Peak QPS
Maximum number of queries during peak hours
Typically 2-5x higher than average QPS
Formula: Average QPS × Peak multiplier (usually 2-3x)
Storage Metrics
Data Volume
Total amount of data to be stored
Includes user data, metadata, and system logs
Consider data growth over time (typically 5-10 years)
Storage Requirements
Raw storage + replication factor + backup storage
Formula: Base storage × (1 + replication factor + backup factor)
Performance Metrics
Response Time
Time taken for the system to respond to a request
Critical for user experience
Target: < 200ms for most applications
Throughput
Amount of data processed per unit time
Formula: Number of operations ÷ Time period
The 5-Step Framework for Capacity Estimation
Follow these steps to break down any capacity estimation problem into manageable chunks.
Step 1: Gather Requirements and State Your Assumptions
This is the most critical step. You must clarify the scope with your interviewer. Don't be afraid to ask questions and state your assumptions out loud.
Users: How many users will the system have? Daily Active Users (DAU) is a great starting point. (e.g., "Let's assume 10 million DAU.")
Usage Patterns: How will users interact with the system?
How many posts/uploads/requests per user per day?
What is the read-to-write ratio? (e.g., a social media feed is read-heavy, with a 100:1 read/write ratio).
Data: What kind of data are we storing? What's the average size of an object (image, video, text)?
Growth: What's the expected data and user growth rate? (e.g., "Let's plan for 5 years of growth at 20% per year.")
💡 Pro Tip: Always use round, simple numbers. It makes the math easier. Saying "10 million DAU" is better than "11,257,000 DAU."
Step 2: Estimate Traffic (Queries Per Second)
Next, convert user activity into requests per second that your servers must handle.
Formulas:
Average Queries Per Second (QPS):
Average QPS= Daily Active Users × Requests per User per Day ÷ 86,400
Peak Queries Per Second (QPS): Traffic is never evenly distributed. There will be peak hours.
Peak QPS = Average QPS×Peak Multiplier (usually 2-3x)
Example:
Let's say we have a system with 10 million DAU, and each user makes 50 requests per day.
Average QPS =
Peak QPS =
You'll also need to break this down by reads and writes using your assumed ratio. If it's a 10:1 read/write ratio:
Peak Read QPS =
Peak Write QPS =
Step 3: Estimate Storage and Bandwidth
Now, figure out how much data you'll need to store and how much data will be moving through your network.
Storage Formulas:
Data per Day:
Daily Storage=(Number of Writes per Day)×(Average Object Size)
Total Storage (for N years):
Total Storage=(Daily Storage×365×Years)×Replication Factor
Don't forget a replication factor! Data is usually replicated 3 times for durability.
Bandwidth Formulas:
Egress (Read) Bandwidth:
Read Bandwidth=Read QPS×Average Response Size
Ingress (Write) Bandwidth:
Write Bandwidth = Write QPS×Average Request Size
Example:
Let's say our 10 million users write 1 post per day, and each post is 1 KB.
Daily Storage =
5-Year Storage (with 3x replication) =
Read Bandwidth (Egress) =
Write Bandwidth (Ingress) =
// for bandwidth
Read requests: 50,000 QPS
Write requests: 500 QPS
Average response size: 2 KB
Average request size: 1 KB
Read bandwidth = 50,000 × 2 KB = 100 MB/s
Write bandwidth = 500 × 1 KB = 0.5 MB/s
Total with 20% overhead = (100 + 0.5) × 1.2 = 120.6 MB/s
Step 4: Estimate Memory (Cache)
Caching is crucial for performance. The 80/20 rule (or Pareto principle) is your best friend here: 20% of the data generates 80% of the traffic. This "hot" data is what you want to cache.
Cache Formula:
Cache Size = Hot Data Percentage × Total Data Size
Memory per Server = Cache Size + Application Memory + OS Overhead
Example:
Using our storage estimate of 55 TB, we want to cache the "hot" data.
Cache Size =
Total data: 10 TB
Hot data: 25%
Cache size = 10 TB × 0.25 = 2.5 TB
Per server cache (10 cache servers) = 2.5 TB ÷ 10 = 250 GB
Application memory per server = 50 GB
Total memory per server = 250 + 50 = 300 GB
So, you'd need a total of 11 TB of RAM for your cache cluster (e.g., Redis or Memcached).
Step 5: Estimate Servers
Finally, calculate how many servers you need to handle the load.
Server Formula:
Number of Servers = Peak QPS ÷ QPS per Server
Consider 30% headroom for failures and maintenance
Final Server Count = Base Server Count × 1.3
You need to make an assumption for "QPS per Server." A typical web server can handle 1,000 - 2,000 QPS, but this varies wildly.
Example:
Let's assume one application server can handle 1,000 QPS. Our peak QPS is 14,500.
Number of Servers =
Peak QPS: 60,000
QPS per server: 2,000
Base servers = 60,000 ÷ 2,000 = 30 servers
With redundancy (N+1) = 30 + 1 = 31 servers
For high availability (2x) = 31 × 2 = 62 servers
💡 Pro Tip: Always add a buffer for redundancy and failures. For N servers, you might provision N+1 or even 2N servers for high availability, so you might say "We need 15 servers, but I'd provision 30 across two data centers for redundancy."
Example 1 : Designing a URL Shortener
Let's apply the framework to a classic problem: TinyURL.
Step 1: Requirements & Assumptions
Writes: 100 million new URLs created per month.
Reads: 100:1 read-to-write ratio.
Data: URLs are max 100 characters. We'll store the original URL and the short hash.
Retention: Keep URLs forever. Plan for 5 years of growth.
Step 2: Traffic (QPS)
Write QPS:
Read QPS:
Peak Read QPS (with 2x factor):
Step 3: Storage & Bandwidth
5-Year Storage:
URLs to store =
Storage per URL = chars for original URL + chars for hash bytes. Let's round up to bytes for metadata, user ID, etc.
Total Storage =
With 3x replication =
Bandwidth:
Read Bandwidth = (This is very low!)
Step 4: Memory (Cache)
Cache Size (20% of hot data):
Step 5: Servers
Application Servers (1,000 QPS per server):
Servers Needed =
With redundancy, let's say 10-12 servers.
Example 2: Social Media Platform
Requirements:
500 million DAU
10 posts per user per day
Average post size: 300 bytes
20% of posts have images (200 KB average)
Calculations:
Traffic:
- Posts per day = 500M × 10 = 5B posts
- Write QPS = 5B ÷ 86,400 = 57,870 QPS
- Read QPS (assuming 50:1 ratio) = 57,870 × 50 = 2,893,500 QPS
Storage per day:
- Text data = 5B × 300 bytes = 1.5 TB
- Image data = 5B × 0.20 × 200 KB = 200 TB
- Total daily storage = 201.5 TB
5-year storage:
- 201.5 TB × 365 days × 5 years = 367,737 TB ≈ 368 PB
Bandwidth:
- Read bandwidth = 2,893,500 × 1 KB = 2.9 GB/s
- Write bandwidth = 57,870 × 2 KB = 115 MB/s
- Total bandwidth = 3.015 GB/s
Servers:
- Application servers = 2,893,500 ÷ 5,000 = 579 servers
- Database servers = 368 PB ÷ 10 TB per server = 36,800 servers
- Cache servers = 368 PB × 0.20 ÷ 1 TB per server = 73,600 servers
Advanced Estimation Techniques
Geographic Distribution
When designing global systems, consider:
Latency requirements: Users expect <100ms response times
Data locality: Store data close to users
Compliance: Data residency requirements
Disaster recovery: Multi-region redundancy
Formula for regional distribution:
textRegional Traffic = Total Traffic × Regional User Percentage
Regional Storage = Total Storage × Data Locality Factor
Seasonal and Peak Load Patterns
Account for traffic variations:
Daily peaks: 2-3x average traffic
Weekly patterns: Weekend vs. weekday differences
Seasonal spikes: Holiday shopping, events
Viral effects: Sudden traffic surges
Peak planning formula:
textPeak Capacity = Base Capacity × Peak Multiplier × Safety Factor
Safety Factor = 1.5 to 2.0 for critical systems
Cost Optimization Strategies
Resource Scaling:
Vertical scaling: Adding more power to existing servers
Horizontal scaling: Adding more servers
Auto-scaling: Dynamic resource allocation
Cost-Performance Trade-offs:
textCost per QPS = (Server Cost + Operational Cost) ÷ QPS Capacity
Total Cost of Ownership = Hardware + Software + Operations + Maintenance
Common Pitfalls and Best Practices
Pitfalls to Avoid
Underestimating growth: Not accounting for viral growth or success
Ignoring peak loads: Planning only for average traffic
Overlooking redundancy: Not planning for failures
Unrealistic assumptions: Using overly optimistic performance numbers
Ignoring operational overhead: Not accounting for monitoring, logging, backups
Best Practices
Start with requirements: Always begin with clear functional and non-functional requirements
Use round numbers: Simplify calculations with approximations
Document assumptions: Make your assumptions explicit and validate them
Plan for failure: Include redundancy and disaster recovery
Monitor and adjust: Continuously validate estimates against real-world performance
Consider the full stack: Don't forget about load balancers, CDNs, and monitoring systems
Tools and Resources
Calculation Tools
Spreadsheet templates: For complex multi-variable calculations
Online calculators: AWS Calculator, Google Cloud Pricing Calculator
Capacity planning software: Specialized tools for enterprise environments
Performance Benchmarks
Database performance: 1,000-10,000 QPS per server
Web server capacity: 10,000-50,000 concurrent connections
Network latency: 1ms datacenter, 50ms cross-region, 150ms global
Storage IOPS: SSD 1,000-10,000, HDD 100-200
Don't Forget to Sanity-Check Your Estimates! ✅
After you've done the math, take a step back and ask if the numbers make sense. This shows senior-level thinking.
Order-of-Magnitude Check: If you calculate that a simple photo-sharing app needs exabytes of storage, you've likely made a mistake. Does the result feel right?
Dimensional Analysis: Did you mix up bits and bytes? Megabytes and Gigabytes? This is a common error that can throw your estimate off by a factor of 8x or 1000x.
Bottleneck Identification: Did one resource estimate come out wildly larger than the others? If you need 10,000 database servers but only 10 application servers, your data storage/access pattern might be flawed.
Compare to Known Systems: Compare your numbers to public data. WhatsApp handles over 100 billion messages a day. Does your messaging app estimate seem reasonable in comparison?
Conclusion
Capacity estimation is both an art and a science that requires technical knowledge, practical experience, and sound judgment. The key is to start with clear requirements, make reasonable assumptions, and use systematic calculations to arrive at resource estimates.
Remember that these are estimates, not precise predictions. The goal is to provide a reasonable baseline for system design decisions while building in appropriate safety margins for growth and unexpected load pattern.
Last updated