📑 Table of Contents
- Problem Understanding
- Requirements & Scope
- High-Level Design
- Hash Function Deep Dive
- Detailed Design
- Advanced Considerations
- Reference Materials
1. Problem Understanding
🧠 What is a URL Shortener?
A URL shortener takes a long URL and creates a much shorter alias that redirects to the original URL. When users click the short URL, they are redirected to the original long URL.
Example:
- Original URL:
https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long - Shortened URL:
https://tinyurl.com/y7keocwj
Core Operations:
- URL Shortening: Given a long URL → return a shorter URL
- URL Redirecting: Given a shorter URL → redirect to the original URL
Why URL Shorteners?
- Space constraints: Twitter’s character limit
- Cleaner appearance: Easier to share and remember
- Analytics: Track click rates and user behavior
- Hide original URLs: Useful for affiliate links
2. Requirements & Scope
🎯 Key Clarifying Questions
Basic Functionality:
- How does the URL shortener work exactly?
- What is the expected traffic volume? (100 million URLs generated per day)
- How long should the shortened URL be? (As short as possible)
- What characters are allowed? (0-9, a-z, A-Z)
- Can shortened URLs be deleted or updated? (For simplicity, assume no)
Advanced Features:
- Do we need custom aliases?
- Should URLs expire after some time?
- Do we need analytics and click tracking?
- Should we detect and prevent spam?
Scale Considerations:
- What’s the read to write ratio? (10:1 - more reads than writes)
- How long should the service run? (10 years)
- Do we need high availability?
📋 Functional Requirements
- URL Shortening: Generate short URLs from long URLs
- URL Redirecting: Redirect short URLs to original URLs
- High Availability: System should be fault-tolerant
- Custom URLs: Support custom short URLs (optional)
📊 Non-Functional Requirements
- Scale: 100 million URLs generated daily
- Availability: 99.9% uptime
- Latency: Low latency for redirecting
- Shortness: URLs should be as short as possible
📐 Back-of-the-Envelope Estimation
Traffic Analysis:
- Write operations: 100 million URLs/day = 1,160/second
- Read operations: 1,160 × 10 = 11,600/second (10:1 ratio)
- Peak QPS: Assume 2× average load = 23,200/second
- Storage for 10 years: 100M × 365 × 10 = 365 billion records
- Average URL length: 100 bytes
- Storage size: 365B × 100 bytes = 365 TB
Bandwidth Estimation:
- Write: 1,160 × 500 bytes = 580 KB/s
- Read: 11,600 × 500 bytes = 5.8 MB/s
Memory Estimation:
- Cache 20% of daily read volume: 2B × 500 bytes = 1 TB cache
3. High-Level Design
🏗️ System Architecture
[Client] → [Web Server] → [Load Balancer] → [URL Shortening Service]
↓
[Database] ← [Cache]
🔌 API Design
1. URL Shortening
POST /api/v1/data/shorten
Request: {longUrl: longURLString}
Response: shortURL
2. URL Redirecting
GET /api/v1/shortUrl
Response: longURL for HTTP redirection
🔄 URL Redirecting Flow
301 vs 302 Redirect:
301 Redirect (Permanent):
- Browser caches the response
- Subsequent requests go directly to long URL
- ✅ Reduces server load
- ❌ Difficult to track analytics
302 Redirect (Temporary):
- All requests go through URL shortener
- ✅ Better for analytics and tracking
- ❌ Higher server load
Implementation Approach:
Use hash table to store <shortURL, longURL> mappings:
- Get longURL:
longURL = hashTable.get(shortURL) - Perform redirect with appropriate HTTP status code
🎯 URL Shortening Flow
High-Level Process:
- User submits long URL
- Check if URL already exists in database
- If exists, return existing short URL
- If not, generate new short URL using hash function
- Store mapping in database
- Return short URL to user
4. Hash Function Deep Dive
🔢 Hash Value Length
Character Set: [0-9, a-z, A-Z] = 62 possible characters
Length Analysis:
- To support 365 billion URLs, find smallest n where 62^n ≥ 365 billion
- When n = 7: 62^7 ≈ 3.5 trillion (sufficient)
- Result: Use 7-character hash values
🛠️ Hash Function Approaches
Approach 1: Hash + Collision Resolution
Method:
- Use well-known hash functions (CRC32, MD5, SHA-1)
- Take first 7 characters of hash result
- Handle collisions by appending predefined string and rehashing
Example:
URL: https://en.wikipedia.org/wiki/Systems_design
CRC32: 5cb54054 (take first 7: 5cb5405)
MD5: 5a62509c (take first 7: 5a62509)
SHA-1: 0fddc8e7b (take first 7: 0fddc8e)
Collision Resolution Process:
- Hash the URL, take first 7 characters
- Check if shortURL exists in database
- If collision detected, append predefined string and rehash
- Repeat until unique shortURL found
Optimization - Bloom Filters:
- Use bloom filters to quickly check existence before database query
- Space-efficient probabilistic data structure
- Reduces expensive database lookups for collision detection
Pros:
- ✅ Fixed output length
- ✅ Well-tested algorithms
- ✅ No dependency on external ID generator
Cons:
- ❌ Collision handling needed
- ❌ Expensive database queries for collision checking
- ❌ Non-deterministic performance due to collisions
Approach 2: Base 62 Conversion
Method:
- Use unique ID generator to create unique numbers
- Convert numbers to base 62 representation
- Character mapping: 0-9 → 0-9, 10-35 → a-z, 36-61 → A-Z
Detailed Algorithm:
Base 62 conversion process for 11157₁₀:
Step 1: 11157 ÷ 62 = 179 remainder 59 → 'X' (59 maps to X)
Step 2: 179 ÷ 62 = 2 remainder 55 → 'T' (55 maps to T)
Step 3: 2 ÷ 62 = 0 remainder 2 → '2' (2 maps to 2)
Reading backwards: 2TX
Result: https://tinyurl.com/2TX
Character Mapping Table:
0-9: 0,1,2,3,4,5,6,7,8,9
10-35: a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
36-61: A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z
Dependency: Unique ID Generator
Requires distributed unique ID generator (covered in Chapter 7):
- Auto-increment with different starting points
- UUID (but too long for our case)
- Ticket servers
- Twitter Snowflake (recommended)
Pros:
- ✅ No collisions (when using unique IDs)
- ✅ Simple and predictable
- ✅ Easy to implement
- ✅ Deterministic performance
Cons:
- ❌ Requires unique ID generator service
- ❌ Sequential IDs might be predictable
- ❌ Dependency on external service
⚖️ Comparison Summary
| Aspect | Hash + Collision | Base 62 Conversion |
|---|---|---|
| Shortness | ✅ Fixed length | ✅ Fixed length |
| Uniqueness | ❌ Requires collision handling | ✅ Guaranteed unique |
| Performance | ❌ DB queries needed | ✅ Fast computation |
| Predictability | ✅ Less predictable | ❌ Sequential patterns |
| Implementation | ❌ More complex | ✅ Simpler |
Recommendation: Use Base 62 conversion with unique ID generator
5. Detailed Design
🗄️ Data Model
Database Schema:
Table: url_mappings
- id: BIGINT (Primary Key)
- shortURL: VARCHAR(7) (Unique Index)
- longURL: TEXT
- createdAt: TIMESTAMP
- expiredAt: TIMESTAMP (NULL for permanent URLs)
Why Database over Hash Table:
- Hash tables require all data in memory (expensive)
- Database provides persistence and ACID properties
- Better for handling large datasets (365 billion records)
� URL Encoding Deep Dive
Base 62 Encoding Implementation:
Base 62 uses characters: [0-9a-zA-Z]
0 → 0, 1 → 1, ..., 9 → 9
10 → a, 11 → b, ..., 35 → z
36 → A, 37 → B, ..., 61 → Z
Algorithm Steps:
- Start with unique integer ID (e.g., 2009215674938)
- Convert to base 62:
- 2009215674938 % 62 = 26 → ‘q’
- 2009215674938 / 62 = 32406382338
- 32406382338 % 62 = 52 → ‘Q’
- Continue until quotient is 0
- Reverse the result: “Qq…” → final short URL
�🔄 URL Shortening Detailed Flow
Step-by-Step Process:
- Input Validation: Check if longURL is valid format
- Duplicate Check: Query database for existing longURL
- Return if Found: If exists, return cached shortURL
- Generate ID: Create unique ID using distributed ID generator
- Encode: Transform ID to shortURL using base 62
- Store: Save (ID, shortURL, longURL, timestamps) in database
- Cache Update: Store mapping in cache for future reads
- Return: Send shortURL back to client
Example Walkthrough:
Input: https://en.wikipedia.org/wiki/Systems_design
Unique ID: 2009215674938
Base 62 conversion: 2009215674938 → "zn9edcu"
Result: https://tinyurl.com/zn9edcu
🚀 URL Redirecting Detailed Flow
With Caching Optimization:
- Input: User clicks shortURL
- Load Balancer: Routes request to web server
- Cache Check: Look for shortURL in cache first
- Cache Hit: If found, redirect immediately
- Cache Miss: Query database for longURL
- Store in Cache: Cache the result for future requests
- Redirect: Return longURL with HTTP redirect status
Performance Benefits:
- Cache reduces database load
- Faster response times for popular URLs
- Better user experience
6. Advanced Considerations
� Unique ID Generator Options
Option 1: Multi-master replication
- Use auto_increment with different starting points
- Server 1: 1, 3, 5, 7, 9, …
- Server 2: 2, 4, 6, 8, 10, …
- Hard to scale beyond small number of servers
Option 2: UUID (Universally Unique Identifier)
- 128-bit number, very low collision probability
- Too long for our use case (would need >7 characters)
Option 3: Ticket Server
- Centralized auto-increment ID generator
- Single point of failure issue
- Can use multiple ticket servers for redundancy
Option 4: Twitter Snowflake
- 64-bit IDs with timestamp, datacenter ID, machine ID, sequence
- Recommended approach for distributed systems
�🛡️ Rate Limiting
Problem: Malicious users might overwhelm system with requests
Solution: Implement rate limiting based on:
- IP address (e.g., 100 requests per hour per IP)
- User ID (for authenticated users)
- API key (for enterprise clients)
📊 Analytics
Tracking Capabilities:
- Click-through rates
- Geographic distribution of clicks
- Time-based usage patterns
- Referrer information
- Device and browser statistics
Implementation:
- Use 302 redirects to track each click
- Store analytics data in separate service
- Aggregate data for real-time dashboards
🔍 Web Server Scaling
Horizontal Scaling:
- Web tier is stateless → easy to add/remove servers
- Use load balancers to distribute traffic
- Auto-scaling based on traffic patterns
🗄️ Database Scaling
Techniques:
- Read Replicas: Handle read-heavy workload (10:1 read/write ratio)
- Sharding: Distribute data across multiple databases
- Partitioning: Split by creation time or hash of shortURL
Sharding Strategy:
- Consistent Hashing: Distribute URLs based on shortURL hash
- Range-based: Partition by creation time (e.g., monthly)
- Directory-based: Maintain lookup service for shard location
🔒 Security Considerations
Threats and Mitigations:
- Spam URLs: Implement URL validation and blacklists
- Malicious Content: Scan URLs for malware/phishing
- Rate Limiting: Prevent DDoS attacks
- Input Validation: Sanitize user inputs
- HTTPS Enforcement: Secure data transmission
📈 Availability & Reliability
High Availability Strategies:
- Multiple Data Centers: Geographic distribution
- Failover Mechanisms: Automatic switching to backup systems
- Health Monitoring: Real-time system monitoring
- Circuit Breakers: Prevent cascade failures
- Backup and Recovery: Regular database backups
⏱️ URL Expiration
Expiration Strategies:
- User-defined: Allow users to set expiration time
- Default Policy: URLs expire after certain period (e.g., 2 years)
- Cleanup Process: Background job to remove expired URLs
- Lazy Deletion: Mark as expired, clean up during read
✅ Interview Wrap-up Topics
🎯 Additional Discussion Points
If there’s extra time in the interview, consider these advanced topics:
Rate Limiting:
- Malicious users might send overwhelming URL shortening requests
- Implement IP-based or user-based rate limiting
- Protects against abuse and ensures fair usage
Web Server Scaling:
- Web tier is stateless → easy horizontal scaling
- Add/remove servers behind load balancer
- Auto-scaling based on traffic patterns
Database Scaling:
- Replication: Master-slave setup for read replicas
- Sharding: Distribute data across multiple databases
- Partitioning: Split by time ranges or hash-based
Analytics Integration:
- Track click patterns and user behavior
- Questions to answer: How many clicks? When? From where?
- Store analytics data in separate service for performance
- Real-time dashboards and reporting
Availability, Consistency, and Reliability:
- Core concepts for any large-scale system
- Trade-offs between consistency and availability (CAP theorem)
- Failure handling and disaster recovery planning
- Monitoring and alerting for system health
7. Reference Materials
System Design Fundamentals
- RESTful API Design: RESTful Tutorial
- Bloom Filters: Wikipedia - for efficient duplicate detection
- Database Sharding: Techniques for horizontal scaling
- Caching Strategies: Redis/Memcached patterns
- Consistent Hashing: Design Consistent Hashing
Industry Examples
- TinyURL: Original URL shortening service
- Bit.ly: Advanced analytics and custom domains
- Google URL Shortener (discontinued): Lessons learned
- Twitter’s t.co: Integrated with social media platform
Technical Deep Dives
- Base Conversion Algorithms: Mathematical foundations
- Unique ID Generation: Distributed system challenges (Snowflake algorithm)
- Hash Function Analysis: Performance and collision characteristics
- Load Balancing: Strategies for high-traffic systems
Distributed Systems Concepts
- CAP Theorem: Consistency, Availability, Partition tolerance trade-offs
- ACID Properties: Database transaction guarantees
- Eventually Consistent: Trade-offs in distributed caching
- Circuit Breaker Pattern: Preventing cascade failures
Best Practices
- API Rate Limiting: Preventing abuse and ensuring fair usage
- Database Optimization: Indexing strategies and query performance
- Security: Input validation, threat prevention, and data protection
- Monitoring: Metrics collection, alerting, and system observability