Back To Envelope Or Capacity Estimation

Mastering Back-of-the-Envelope Calculations for System Design

Capacity estimation, or "back-of-the-envelope" calculation, is a foundational skill in system design. It's what separates a good design from a great one. Think of it as a feasibility check that prevents costly redesigns and ensures your architecture can handle real-world traffic.

This guide will walk you through a simple framework, key formulas, and the essential numbers you need to know to confidently tackle these estimations.

Essential Numbers to Know 🧠

Before you start calculating, you need a baseline. In an interview, you're expected to know some common latency and data size figures. Memorize these!

Operation

Typical Latency

Read from L1 Cache

~0.5 ns

Read from L2 Cache

~7 ns

Main memory read (RAM)

~100 ns

Round trip within the same data center (DC)

~0.5 ms

Read 1 MB sequentially from memory

~20 μs

Read 1 MB sequentially from SSD

~1 ms

Read 1 MB sequentially from HDD

~20 ms

Round trip from US East to US West

~40-50 ms

Round trip from US to Europe

~100-150 ms

Latency Numbers Every System Designer Should Know - Logarithmic scale showing the vast differences in operation timing

Common Data Size Assumptions:

Integer/Long: 4-8 bytes
UUID: 16 bytes
Character (UTF-8): 1-4 bytes
Average URL: ~100 characters
Average text post (e.g., Tweet): ~300 bytes
Average image size: 200 KB - 2 MB
Average database record: 1-10 KB

Key Metrics for Capacity Estimation

Traffic Metrics

Daily Active Users (DAU)

The number of unique users who use your application in a single day
Directly impacts server load and resource requirements
Formula: Total unique users accessing the system per day

Queries Per Second (QPS)

The number of requests your system processes every second
Critical for determining server capacity and load balancing needs
Formula: Total daily requests ÷ 86,400 seconds

Peak QPS

Maximum number of queries during peak hours
Typically 2-5x higher than average QPS
Formula: Average QPS × Peak multiplier (usually 2-3x)

Storage Metrics

Data Volume

Total amount of data to be stored
Includes user data, metadata, and system logs
Consider data growth over time (typically 5-10 years)

Storage Requirements

Raw storage + replication factor + backup storage
Formula: Base storage × (1 + replication factor + backup factor)

Performance Metrics

Response Time

Time taken for the system to respond to a request
Critical for user experience
Target: < 200ms for most applications

Throughput

Amount of data processed per unit time
Formula: Number of operations ÷ Time period

The 5-Step Framework for Capacity Estimation

Follow these steps to break down any capacity estimation problem into manageable chunks.

Step 1: Gather Requirements and State Your Assumptions

This is the most critical step. You must clarify the scope with your interviewer. Don't be afraid to ask questions and state your assumptions out loud.

Users: How many users will the system have? Daily Active Users (DAU) is a great starting point. (e.g., "Let's assume 10 million DAU.")
Usage Patterns: How will users interact with the system?
- How many posts/uploads/requests per user per day?
- What is the read-to-write ratio? (e.g., a social media feed is read-heavy, with a 100:1 read/write ratio).
Data: What kind of data are we storing? What's the average size of an object (image, video, text)?
Growth: What's the expected data and user growth rate? (e.g., "Let's plan for 5 years of growth at 20% per year.")

💡 Pro Tip: Always use round, simple numbers. It makes the math easier. Saying "10 million DAU" is better than "11,257,000 DAU."

Step 2: Estimate Traffic (Queries Per Second)

Next, convert user activity into requests per second that your servers must handle.

Formulas:

Average Queries Per Second (QPS):
Average QPS= Daily Active Users × Requests per User per Day ÷ 86,400
Peak Queries Per Second (QPS): Traffic is never evenly distributed. There will be peak hours.
Peak QPS = Average QPS×Peak Multiplier (usually 2-3x)

Example:

Let's say we have a system with 10 million DAU, and each user makes 50 requests per day.

Average QPS = $(10,000,000×50)÷86,400=500,000,000÷86,400≈5,800 QPS$
Peak QPS = $5,800×2.5=14,500 QPS$

You'll also need to break this down by reads and writes using your assumed ratio. If it's a 10:1 read/write ratio:

Peak Read QPS = $14,500×(10/11)≈13,180 QPS$
Peak Write QPS = $14,500×(1/11)≈1,320 QPS$

Step 3: Estimate Storage and Bandwidth

Now, figure out how much data you'll need to store and how much data will be moving through your network.

Storage Formulas:

Data per Day:
Daily Storage=(Number of Writes per Day)×(Average Object Size)
Total Storage (for N years):
Total Storage=(Daily Storage×365×Years)×Replication Factor
Don't forget a replication factor! Data is usually replicated 3 times for durability.

Bandwidth Formulas:

Egress (Read) Bandwidth:
Read Bandwidth=Read QPS×Average Response Size
Ingress (Write) Bandwidth:
Write Bandwidth = Write QPS×Average Request Size

Example:

Let's say our 10 million users write 1 post per day, and each post is 1 KB.

Daily Storage = $10,000,000×1 KB=10 GB$
5-Year Storage (with 3x replication) = $(10 GB×365×5)×3=18,250 GB×3≈55 TB$
Read Bandwidth (Egress) = $13,180 QPS×1 KB/response≈13.2 MB/s$
Write Bandwidth (Ingress) = $1,320 QPS×1 KB/request≈1.3 MB/s$

// for bandwidth

Read requests: 50,000 QPS
Write requests: 500 QPS
Average response size: 2 KB
Average request size: 1 KB
Read bandwidth = 50,000 × 2 KB = 100 MB/s
Write bandwidth = 500 × 1 KB = 0.5 MB/s
Total with 20% overhead = (100 + 0.5) × 1.2 = 120.6 MB/s

Step 4: Estimate Memory (Cache)

Caching is crucial for performance. The 80/20 rule (or Pareto principle) is your best friend here: 20% of the data generates 80% of the traffic. This "hot" data is what you want to cache.

Cache Formula:

Cache Size = Hot Data Percentage × Total Data Size
Memory per Server = Cache Size + Application Memory + OS Overhead

Example:

Using our storage estimate of 55 TB, we want to cache the "hot" data.

Cache Size = $55 TB×0.20=11 TB$

Total data: 10 TB
Hot data: 25%
Cache size = 10 TB × 0.25 = 2.5 TB
Per server cache (10 cache servers) = 2.5 TB ÷ 10 = 250 GB
Application memory per server = 50 GB
Total memory per server = 250 + 50 = 300 GB

So, you'd need a total of 11 TB of RAM for your cache cluster (e.g., Redis or Memcached).

Step 5: Estimate Servers

Finally, calculate how many servers you need to handle the load.

Server Formula:

Number of Servers = Peak QPS ÷ QPS per Server
Consider 30% headroom for failures and maintenance
Final Server Count = Base Server Count × 1.3

You need to make an assumption for "QPS per Server." A typical web server can handle 1,000 - 2,000 QPS, but this varies wildly.

Example:

Let's assume one application server can handle 1,000 QPS. Our peak QPS is 14,500.

Number of Servers = $14,500÷1,000=14.5→round up to 15 servers$

Peak QPS: 60,000
QPS per server: 2,000
Base servers = 60,000 ÷ 2,000 = 30 servers
With redundancy (N+1) = 30 + 1 = 31 servers
For high availability (2x) = 31 × 2 = 62 servers

💡 Pro Tip: Always add a buffer for redundancy and failures. For N servers, you might provision N+1 or even 2N servers for high availability, so you might say "We need 15 servers, but I'd provision 30 across two data centers for redundancy."

Example 1 : Designing a URL Shortener

Let's apply the framework to a classic problem: TinyURL.

Step 1: Requirements & Assumptions

Writes: 100 million new URLs created per month.
Reads: 100:1 read-to-write ratio.
Data: URLs are max 100 characters. We'll store the original URL and the short hash.
Retention: Keep URLs forever. Plan for 5 years of growth.

Step 2: Traffic (QPS)

Write QPS:
- $100M URLs/month÷(30 days×86,400 s)≈100,000,000÷2,600,000≈40 QPS$
Read QPS:
- $40 QPS×100=4,000 QPS$
Peak Read QPS (with 2x factor):
- $4,000×2=8,000 QPS$

Step 3: Storage & Bandwidth

5-Year Storage:
- URLs to store = $100M/month×12 months×5 years=6 Billion URLs$
- Storage per URL = $100$ chars for original URL + $8$ chars for hash $≈110$ bytes. Let's round up to $500$ bytes for metadata, user ID, etc.
- Total Storage = $6B×500 bytes=3 TB$
- With 3x replication = $3 TB×3=9 TB$
Bandwidth:
- Read Bandwidth = $8,000 QPS×500 bytes=4 MB/s$ (This is very low!)

Step 4: Memory (Cache)

Cache Size (20% of hot data):
- $9 TB×0.20=1.8 TB$

Step 5: Servers

Application Servers (1,000 QPS per server):
- Servers Needed = $8,000 (Peak Read QPS)÷1,000=8 servers$
- With redundancy, let's say 10-12 servers.

Requirements:

500 million DAU
10 posts per user per day
Average post size: 300 bytes
20% of posts have images (200 KB average)

Calculations:

Traffic:
- Posts per day = 500M × 10 = 5B posts
- Write QPS = 5B ÷ 86,400 = 57,870 QPS
- Read QPS (assuming 50:1 ratio) = 57,870 × 50 = 2,893,500 QPS

Storage per day:
- Text data = 5B × 300 bytes = 1.5 TB
- Image data = 5B × 0.20 × 200 KB = 200 TB
- Total daily storage = 201.5 TB

5-year storage:
- 201.5 TB × 365 days × 5 years = 367,737 TB ≈ 368 PB

Bandwidth:
- Read bandwidth = 2,893,500 × 1 KB = 2.9 GB/s
- Write bandwidth = 57,870 × 2 KB = 115 MB/s
- Total bandwidth = 3.015 GB/s

Servers:
- Application servers = 2,893,500 ÷ 5,000 = 579 servers
- Database servers = 368 PB ÷ 10 TB per server = 36,800 servers
- Cache servers = 368 PB × 0.20 ÷ 1 TB per server = 73,600 servers

Advanced Estimation Techniques

Geographic Distribution

When designing global systems, consider:

Latency requirements: Users expect <100ms response times
Data locality: Store data close to users
Compliance: Data residency requirements
Disaster recovery: Multi-region redundancy

Formula for regional distribution:

textRegional Traffic = Total Traffic × Regional User Percentage
Regional Storage = Total Storage × Data Locality Factor

Seasonal and Peak Load Patterns

Account for traffic variations:

Daily peaks: 2-3x average traffic
Weekly patterns: Weekend vs. weekday differences
Seasonal spikes: Holiday shopping, events
Viral effects: Sudden traffic surges

Peak planning formula:

textPeak Capacity = Base Capacity × Peak Multiplier × Safety Factor
Safety Factor = 1.5 to 2.0 for critical systems

Cost Optimization Strategies

Resource Scaling:

Vertical scaling: Adding more power to existing servers
Horizontal scaling: Adding more servers
Auto-scaling: Dynamic resource allocation

Cost-Performance Trade-offs:

textCost per QPS = (Server Cost + Operational Cost) ÷ QPS Capacity
Total Cost of Ownership = Hardware + Software + Operations + Maintenance

Common Pitfalls and Best Practices

Pitfalls to Avoid

Underestimating growth: Not accounting for viral growth or success
Ignoring peak loads: Planning only for average traffic
Overlooking redundancy: Not planning for failures
Unrealistic assumptions: Using overly optimistic performance numbers
Ignoring operational overhead: Not accounting for monitoring, logging, backups

Best Practices

Start with requirements: Always begin with clear functional and non-functional requirements
Use round numbers: Simplify calculations with approximations
Document assumptions: Make your assumptions explicit and validate them
Plan for failure: Include redundancy and disaster recovery
Monitor and adjust: Continuously validate estimates against real-world performance
Consider the full stack: Don't forget about load balancers, CDNs, and monitoring systems

Tools and Resources

Calculation Tools

Spreadsheet templates: For complex multi-variable calculations
Online calculators: AWS Calculator, Google Cloud Pricing Calculator
Capacity planning software: Specialized tools for enterprise environments

Performance Benchmarks

Database performance: 1,000-10,000 QPS per server
Web server capacity: 10,000-50,000 concurrent connections
Network latency: 1ms datacenter, 50ms cross-region, 150ms global
Storage IOPS: SSD 1,000-10,000, HDD 100-200

Don't Forget to Sanity-Check Your Estimates! ✅

After you've done the math, take a step back and ask if the numbers make sense. This shows senior-level thinking.

Order-of-Magnitude Check: If you calculate that a simple photo-sharing app needs exabytes of storage, you've likely made a mistake. Does the result feel right?
Dimensional Analysis: Did you mix up bits and bytes? Megabytes and Gigabytes? This is a common error that can throw your estimate off by a factor of 8x or 1000x.
Bottleneck Identification: Did one resource estimate come out wildly larger than the others? If you need 10,000 database servers but only 10 application servers, your data storage/access pattern might be flawed.
Compare to Known Systems: Compare your numbers to public data. WhatsApp handles over 100 billion messages a day. Does your messaging app estimate seem reasonable in comparison?

Conclusion

Capacity estimation is both an art and a science that requires technical knowledge, practical experience, and sound judgment. The key is to start with clear requirements, make reasonable assumptions, and use systematic calculations to arrive at resource estimates.

Remember that these are estimates, not precise predictions. The goal is to provide a reasonable baseline for system design decisions while building in appropriate safety margins for growth and unexpected load pattern.

PreviousDNS Record Types NextRate Limiter

Last updated 3 months ago

Mastering Back-of-the-Envelope Calculations for System Design

Essential Numbers to Know 🧠

Key Metrics for Capacity Estimation

Traffic Metrics

Storage Metrics

Performance Metrics

The 5-Step Framework for Capacity Estimation

Step 1: Gather Requirements and State Your Assumptions

Step 2: Estimate Traffic (Queries Per Second)

Step 3: Estimate Storage and Bandwidth

Step 4: Estimate Memory (Cache)

Step 5: Estimate Servers

Example 1 : Designing a URL Shortener

Example 2: Social Media Platform

Advanced Estimation Techniques

Geographic Distribution

Seasonal and Peak Load Patterns

Cost Optimization Strategies

Common Pitfalls and Best Practices

Pitfalls to Avoid

Best Practices

Tools and Resources

Calculation Tools

Performance Benchmarks

Don't Forget to Sanity-Check Your Estimates! ✅

Conclusion