Back To Envelope Or Capacity Estimation

Mastering Back-of-the-Envelope Calculations for System Design

Capacity estimation, or "back-of-the-envelope" calculation, is a foundational skill in system design. It's what separates a good design from a great one. Think of it as a feasibility check that prevents costly redesigns and ensures your architecture can handle real-world traffic.

This guide will walk you through a simple framework, key formulas, and the essential numbers you need to know to confidently tackle these estimations.


Essential Numbers to Know 🧠

Before you start calculating, you need a baseline. In an interview, you're expected to know some common latency and data size figures. Memorize these!

Operation
Typical Latency

Read from L1 Cache

~0.5 ns

Read from L2 Cache

~7 ns

Main memory read (RAM)

~100 ns

Round trip within the same data center (DC)

~0.5 ms

Read 1 MB sequentially from memory

~20 μs

Read 1 MB sequentially from SSD

~1 ms

Read 1 MB sequentially from HDD

~20 ms

Round trip from US East to US West

~40-50 ms

Round trip from US to Europe

~100-150 ms

Latency Numbers Every System Designer Should Know - Logarithmic scale showing the vast differences in operation timing

Common Data Size Assumptions:

  • Integer/Long: 4-8 bytes

  • UUID: 16 bytes

  • Character (UTF-8): 1-4 bytes

  • Average URL: ~100 characters

  • Average text post (e.g., Tweet): ~300 bytes

  • Average image size: 200 KB - 2 MB

  • Average database record: 1-10 KB


Key Metrics for Capacity Estimation

Traffic Metrics

Daily Active Users (DAU)

  • The number of unique users who use your application in a single day

  • Directly impacts server load and resource requirements

  • Formula: Total unique users accessing the system per day

Queries Per Second (QPS)

  • The number of requests your system processes every second

  • Critical for determining server capacity and load balancing needs

  • Formula: Total daily requests ÷ 86,400 seconds

Peak QPS

  • Maximum number of queries during peak hours

  • Typically 2-5x higher than average QPS

  • Formula: Average QPS × Peak multiplier (usually 2-3x)

Storage Metrics

Data Volume

  • Total amount of data to be stored

  • Includes user data, metadata, and system logs

  • Consider data growth over time (typically 5-10 years)

Storage Requirements

  • Raw storage + replication factor + backup storage

  • Formula: Base storage × (1 + replication factor + backup factor)

Performance Metrics

Response Time

  • Time taken for the system to respond to a request

  • Critical for user experience

  • Target: < 200ms for most applications

Throughput

  • Amount of data processed per unit time

  • Formula: Number of operations ÷ Time period

The 5-Step Framework for Capacity Estimation

Follow these steps to break down any capacity estimation problem into manageable chunks.

Step 1: Gather Requirements and State Your Assumptions

This is the most critical step. You must clarify the scope with your interviewer. Don't be afraid to ask questions and state your assumptions out loud.

  • Users: How many users will the system have? Daily Active Users (DAU) is a great starting point. (e.g., "Let's assume 10 million DAU.")

  • Usage Patterns: How will users interact with the system?

    • How many posts/uploads/requests per user per day?

    • What is the read-to-write ratio? (e.g., a social media feed is read-heavy, with a 100:1 read/write ratio).

  • Data: What kind of data are we storing? What's the average size of an object (image, video, text)?

  • Growth: What's the expected data and user growth rate? (e.g., "Let's plan for 5 years of growth at 20% per year.")

💡 Pro Tip: Always use round, simple numbers. It makes the math easier. Saying "10 million DAU" is better than "11,257,000 DAU."

Step 2: Estimate Traffic (Queries Per Second)

Next, convert user activity into requests per second that your servers must handle.

Formulas:

  • Average Queries Per Second (QPS):

    Average QPS= Daily Active Users × Requests per User per Day ÷ 86,400

  • Peak Queries Per Second (QPS): Traffic is never evenly distributed. There will be peak hours.

    Peak QPS = Average QPS×Peak Multiplier (usually 2-3x)

Example:

Let's say we have a system with 10 million DAU, and each user makes 50 requests per day.

  • Average QPS = (10,000,000×50)÷86,400=500,000,000÷86,4005,800QPS(10,000,000×50)÷86,400=500,000,000÷86,400≈5,800 QPS

  • Peak QPS = 5,800×2.5=14,500QPS5,800×2.5=14,500 QPS

You'll also need to break this down by reads and writes using your assumed ratio. If it's a 10:1 read/write ratio:

  • Peak Read QPS = 14,500×(10/11)13,180QPS14,500×(10/11)≈13,180 QPS

  • Peak Write QPS = 14,500×(1/11)1,320QPS14,500×(1/11)≈1,320 QPS

Step 3: Estimate Storage and Bandwidth

Now, figure out how much data you'll need to store and how much data will be moving through your network.

Storage Formulas:

  • Data per Day:

    Daily Storage=(Number of Writes per Day)×(Average Object Size)

  • Total Storage (for N years):

    Total Storage=(Daily Storage×365×Years)×Replication Factor

    Don't forget a replication factor! Data is usually replicated 3 times for durability.

Bandwidth Formulas:

  • Egress (Read) Bandwidth:

    Read Bandwidth=Read QPS×Average Response Size

  • Ingress (Write) Bandwidth:

    Write Bandwidth = Write QPS×Average Request Size

Example:

Let's say our 10 million users write 1 post per day, and each post is 1 KB.

  • Daily Storage = 10,000,000×1KB=10GB10,000,000×1 KB=10 GB

  • 5-Year Storage (with 3x replication) = (10GB×365×5)×3=18,250GB×355TB(10 GB×365×5)×3=18,250 GB×3≈55 TB

  • Read Bandwidth (Egress) = 13,180QPS×1KB/response13.2MB/s13,180 QPS×1 KB/response≈13.2 MB/s

  • Write Bandwidth (Ingress) = 1,320QPS×1KB/request1.3MB/s1,320 QPS×1 KB/request≈1.3 MB/s

// for bandwidth

Read requests: 50,000 QPS
Write requests: 500 QPS
Average response size: 2 KB
Average request size: 1 KB
Read bandwidth = 50,000 × 2 KB = 100 MB/s
Write bandwidth = 500 × 1 KB = 0.5 MB/s
Total with 20% overhead = (100 + 0.5) × 1.2 = 120.6 MB/s

Step 4: Estimate Memory (Cache)

Caching is crucial for performance. The 80/20 rule (or Pareto principle) is your best friend here: 20% of the data generates 80% of the traffic. This "hot" data is what you want to cache.

Cache Formula:

Cache Size = Hot Data Percentage × Total Data Size
Memory per Server = Cache Size + Application Memory + OS Overhead

Example:

Using our storage estimate of 55 TB, we want to cache the "hot" data.

  • Cache Size = 55TB×0.20=11TB55 TB×0.20=11 TB

Total data: 10 TB
Hot data: 25%
Cache size = 10 TB × 0.25 = 2.5 TB
Per server cache (10 cache servers) = 2.5 TB ÷ 10 = 250 GB
Application memory per server = 50 GB
Total memory per server = 250 + 50 = 300 GB

So, you'd need a total of 11 TB of RAM for your cache cluster (e.g., Redis or Memcached).

Step 5: Estimate Servers

Finally, calculate how many servers you need to handle the load.

Server Formula:

Number of Servers = Peak QPS ÷ QPS per Server
Consider 30% headroom for failures and maintenance
Final Server Count = Base Server Count × 1.3

You need to make an assumption for "QPS per Server." A typical web server can handle 1,000 - 2,000 QPS, but this varies wildly.

Example:

Let's assume one application server can handle 1,000 QPS. Our peak QPS is 14,500.

  • Number of Servers = 14,500÷1,000=14.5roundupto15servers14,500÷1,000=14.5→round up to 15 servers

Peak QPS: 60,000
QPS per server: 2,000
Base servers = 60,000 ÷ 2,000 = 30 servers
With redundancy (N+1) = 30 + 1 = 31 servers
For high availability (2x) = 31 × 2 = 62 servers

💡 Pro Tip: Always add a buffer for redundancy and failures. For N servers, you might provision N+1 or even 2N servers for high availability, so you might say "We need 15 servers, but I'd provision 30 across two data centers for redundancy."


Example 1 : Designing a URL Shortener

Let's apply the framework to a classic problem: TinyURL.

Step 1: Requirements & Assumptions

  • Writes: 100 million new URLs created per month.

  • Reads: 100:1 read-to-write ratio.

  • Data: URLs are max 100 characters. We'll store the original URL and the short hash.

  • Retention: Keep URLs forever. Plan for 5 years of growth.

Step 2: Traffic (QPS)

  • Write QPS:

    • 100MURLs/month÷(30days×86,400s)100,000,000÷2,600,00040QPS100M URLs/month÷(30 days×86,400 s)≈100,000,000÷2,600,000≈40 QPS

  • Read QPS:

    • 40QPS×100=4,000QPS40 QPS×100=4,000 QPS

  • Peak Read QPS (with 2x factor):

    • 4,000×2=8,000QPS4,000×2=8,000 QPS

Step 3: Storage & Bandwidth

  • 5-Year Storage:

    • URLs to store = 100M/month×12months×5years=6BillionURLs100M/month×12 months×5 years=6 Billion URLs

    • Storage per URL = 100100 chars for original URL + 88 chars for hash 110≈110 bytes. Let's round up to 500500 bytes for metadata, user ID, etc.

    • Total Storage = 6B×500bytes=3TB6B×500 bytes=3 TB

    • With 3x replication = 3TB×3=9TB3 TB×3=9 TB

  • Bandwidth:

    • Read Bandwidth = 8,000QPS×500bytes=4MB/s8,000 QPS×500 bytes=4 MB/s (This is very low!)

Step 4: Memory (Cache)

  • Cache Size (20% of hot data):

    • 9TB×0.20=1.8TB9 TB×0.20=1.8 TB

Step 5: Servers

  • Application Servers (1,000 QPS per server):

    • Servers Needed = 8,000(PeakReadQPS)÷1,000=8servers8,000 (Peak Read QPS)÷1,000=8 servers

    • With redundancy, let's say 10-12 servers.


Example 2: Social Media Platform

Requirements:

  • 500 million DAU

  • 10 posts per user per day

  • Average post size: 300 bytes

  • 20% of posts have images (200 KB average)

Calculations:

Traffic:
- Posts per day = 500M × 10 = 5B posts
- Write QPS = 5B ÷ 86,400 = 57,870 QPS
- Read QPS (assuming 50:1 ratio) = 57,870 × 50 = 2,893,500 QPS

Storage per day:
- Text data = 5B × 300 bytes = 1.5 TB
- Image data = 5B × 0.20 × 200 KB = 200 TB
- Total daily storage = 201.5 TB

5-year storage:
- 201.5 TB × 365 days × 5 years = 367,737 TB ≈ 368 PB

Bandwidth:
- Read bandwidth = 2,893,500 × 1 KB = 2.9 GB/s
- Write bandwidth = 57,870 × 2 KB = 115 MB/s
- Total bandwidth = 3.015 GB/s

Servers:
- Application servers = 2,893,500 ÷ 5,000 = 579 servers
- Database servers = 368 PB ÷ 10 TB per server = 36,800 servers
- Cache servers = 368 PB × 0.20 ÷ 1 TB per server = 73,600 servers

Advanced Estimation Techniques

Geographic Distribution

When designing global systems, consider:

  • Latency requirements: Users expect <100ms response times

  • Data locality: Store data close to users

  • Compliance: Data residency requirements

  • Disaster recovery: Multi-region redundancy

Formula for regional distribution:

textRegional Traffic = Total Traffic × Regional User Percentage
Regional Storage = Total Storage × Data Locality Factor

Seasonal and Peak Load Patterns

Account for traffic variations:

  • Daily peaks: 2-3x average traffic

  • Weekly patterns: Weekend vs. weekday differences

  • Seasonal spikes: Holiday shopping, events

  • Viral effects: Sudden traffic surges

Peak planning formula:

textPeak Capacity = Base Capacity × Peak Multiplier × Safety Factor
Safety Factor = 1.5 to 2.0 for critical systems

Cost Optimization Strategies

Resource Scaling:

  • Vertical scaling: Adding more power to existing servers

  • Horizontal scaling: Adding more servers

  • Auto-scaling: Dynamic resource allocation

Cost-Performance Trade-offs:

textCost per QPS = (Server Cost + Operational Cost) ÷ QPS Capacity
Total Cost of Ownership = Hardware + Software + Operations + Maintenance

Common Pitfalls and Best Practices

Pitfalls to Avoid

  1. Underestimating growth: Not accounting for viral growth or success

  2. Ignoring peak loads: Planning only for average traffic

  3. Overlooking redundancy: Not planning for failures

  4. Unrealistic assumptions: Using overly optimistic performance numbers

  5. Ignoring operational overhead: Not accounting for monitoring, logging, backups

Best Practices

  1. Start with requirements: Always begin with clear functional and non-functional requirements

  2. Use round numbers: Simplify calculations with approximations

  3. Document assumptions: Make your assumptions explicit and validate them

  4. Plan for failure: Include redundancy and disaster recovery

  5. Monitor and adjust: Continuously validate estimates against real-world performance

  6. Consider the full stack: Don't forget about load balancers, CDNs, and monitoring systems

Tools and Resources

Calculation Tools

  • Spreadsheet templates: For complex multi-variable calculations

  • Online calculators: AWS Calculator, Google Cloud Pricing Calculator

  • Capacity planning software: Specialized tools for enterprise environments

Performance Benchmarks

  • Database performance: 1,000-10,000 QPS per server

  • Web server capacity: 10,000-50,000 concurrent connections

  • Network latency: 1ms datacenter, 50ms cross-region, 150ms global

  • Storage IOPS: SSD 1,000-10,000, HDD 100-200

Don't Forget to Sanity-Check Your Estimates! ✅

After you've done the math, take a step back and ask if the numbers make sense. This shows senior-level thinking.

  • Order-of-Magnitude Check: If you calculate that a simple photo-sharing app needs exabytes of storage, you've likely made a mistake. Does the result feel right?

  • Dimensional Analysis: Did you mix up bits and bytes? Megabytes and Gigabytes? This is a common error that can throw your estimate off by a factor of 8x or 1000x.

  • Bottleneck Identification: Did one resource estimate come out wildly larger than the others? If you need 10,000 database servers but only 10 application servers, your data storage/access pattern might be flawed.

  • Compare to Known Systems: Compare your numbers to public data. WhatsApp handles over 100 billion messages a day. Does your messaging app estimate seem reasonable in comparison?

Conclusion

Capacity estimation is both an art and a science that requires technical knowledge, practical experience, and sound judgment. The key is to start with clear requirements, make reasonable assumptions, and use systematic calculations to arrive at resource estimates.

Remember that these are estimates, not precise predictions. The goal is to provide a reasonable baseline for system design decisions while building in appropriate safety margins for growth and unexpected load pattern.

Last updated