📑 Table of Contents
- Problem Understanding
- Requirements & Scope
- High-Level Design
- Deep Dive
- Advanced Considerations
- Reference Materials
1. Problem Understanding
🧠 What is a Web Crawler?
A web crawler (also known as spider or robot) is a system that systematically browses the web to discover and download content. It starts with seed URLs and follows links to find new pages.
Basic Algorithm:
- Start with a set of seed URLs
- Download web pages from these URLs
- Extract new URLs from downloaded pages
- Add new URLs to the crawling queue
- Repeat the process
Common Use Cases:
- Search Engine Indexing: Google, Bing crawlers for search results
- Web Archiving: Preserving web content for historical purposes
- Web Mining: Data extraction for business intelligence
- Web Monitoring: Copyright and trademark infringement detection
🎯 Key Challenges
- Scale: Billions of web pages exist
- Politeness: Respect website rate limits
- Robustness: Handle bad HTML, server failures, malicious content
- Freshness: Detect updated content
- Duplicate Detection: Avoid storing same content multiple times
2. Requirements & Scope
🎯 Key Clarifying Questions
Purpose and Scale:
- What is the main purpose (search indexing, archiving, monitoring)?
- How many web pages to crawl per month?
- What content types to support (HTML, PDFs, images)?
- Should we crawl newly added/updated pages?
- How long should we store crawled content?
Content Handling:
- How to handle duplicate content?
- Do we need to respect robots.txt?
- Should we support JavaScript-rendered pages?
- What about authentication-protected content?
Performance and Behavior:
- What’s the acceptable crawling rate per website?
- How to handle crawling failures and retries?
- Do we need real-time crawling or batch processing?
- What about geographic distribution of crawlers?
📋 Functional Requirements
- Content Discovery: Crawl and download web pages
- Link Extraction: Extract URLs from HTML content
- Duplicate Detection: Avoid processing same content
- Content Storage: Store crawled pages for specified duration
- Politeness: Respect website crawling policies
- Freshness: Re-crawl pages to detect updates
📊 Non-Functional Requirements
- Scalability: Handle billions of web pages
- Robustness: Gracefully handle failures and edge cases
- Politeness: Don’t overwhelm target websites
- Extensibility: Easy to add new content types
- Performance: Efficient parallel processing
📐 Back-of-the-Envelope Estimation
Traffic and Scale:
- Pages per month: 1 billion pages
- QPS: 1B / (30 days × 24 hours × 3600 seconds) ≈ 400 pages/second
- Peak QPS: 2 × 400 = 800 pages/second
- Average page size: 500KB
- Monthly storage: 1B × 500KB = 500TB
- 5-year storage: 500TB × 12 × 5 = 30PB
3. High-Level Design
🏗️ System Architecture
[Seed URLs] → [URL Frontier] → [HTML Downloader] → [DNS Resolver]
↑ ↓
[URL Storage] ← [URL Seen?] ← [URL Filter] ← [URL Extractor]
↓ ↑
[Content Storage] ← [Content Seen?] ← [Content Parser]
🔧 Core Components
1. Seed URLs
Starting points for the crawling process:
# Example seed URL strategies
geographic_seeds = [
"https://en.wikipedia.org", # Global
"https://baidu.com", # China
"https://yahoo.co.jp" # Japan
]
topic_seeds = [
"https://amazon.com", # E-commerce
"https://stackoverflow.com", # Technology
"https://reuters.com" # News
]
2. URL Frontier
FIFO queue storing URLs to be downloaded:
class URLFrontier:
def __init__(self):
self.queue = deque()
self.seen_urls = set()
def add_url(self, url):
if url not in self.seen_urls:
self.queue.append(url)
self.seen_urls.add(url)
def get_url(self):
return self.queue.popleft() if self.queue else None
3. HTML Downloader
Downloads web pages with proper error handling:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class HTMLDownloader:
def __init__(self):
self.session = requests.Session()
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
self.session.mount("http://", adapter)
self.session.mount("https://", adapter)
def download(self, url):
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response.text
except Exception as e:
logging.error(f"Failed to download {url}: {e}")
return None
4. Content Parser
Validates and parses HTML content:
from bs4 import BeautifulSoup
import html5lib
class ContentParser:
def parse(self, html_content):
try:
soup = BeautifulSoup(html_content, 'html5lib')
return soup
except Exception as e:
logging.error(f"Failed to parse HTML: {e}")
return None
def is_valid_html(self, html_content):
try:
BeautifulSoup(html_content, 'html5lib')
return True
except:
return False
5. URL Extractor
Extracts and normalizes URLs from HTML:
from urllib.parse import urljoin, urlparse
import re
class URLExtractor:
def extract_urls(self, html_content, base_url):
soup = BeautifulSoup(html_content, 'html5lib')
urls = []
# Extract from anchor tags
for link in soup.find_all('a', href=True):
url = urljoin(base_url, link['href'])
if self.is_valid_url(url):
urls.append(url)
return urls
def is_valid_url(self, url):
parsed = urlparse(url)
return parsed.scheme in ['http', 'https'] and parsed.netloc
🔄 Crawling Workflow
Step-by-Step Process:
- Initialize: Add seed URLs to URL Frontier
- Fetch: HTML Downloader gets URLs from Frontier
- Resolve: DNS Resolver converts URLs to IP addresses
- Download: Fetch HTML content from web servers
- Parse: Content Parser validates and parses HTML
- Check Duplicates: Content Seen component checks for duplicates
- Store: Save unique content to Content Storage
- Extract Links: URL Extractor finds new URLs
- Filter: URL Filter removes unwanted URLs
- Deduplicate: URL Seen component removes duplicate URLs
- Queue: Add new URLs to URL Frontier
4. Deep Dive
🚦 URL Frontier Design
Politeness Requirements:
import time
from collections import defaultdict, deque
class PoliteFrontier:
def __init__(self):
self.host_queues = defaultdict(deque)
self.host_last_access = defaultdict(float)
self.min_delay = 1.0 # 1 second between requests
def add_url(self, url):
host = urlparse(url).netloc
self.host_queues[host].append(url)
def get_url(self):
current_time = time.time()
for host, queue in self.host_queues.items():
if queue and (current_time - self.host_last_access[host]) >= self.min_delay:
self.host_last_access[host] = current_time
return queue.popleft()
return None
Priority-Based Crawling:
import heapq
class PriorityFrontier:
def __init__(self):
self.priority_queue = []
self.counter = 0
def add_url(self, url, priority=1):
# Lower number = higher priority
heapq.heappush(self.priority_queue, (priority, self.counter, url))
self.counter += 1
def get_url(self):
if self.priority_queue:
_, _, url = heapq.heappop(self.priority_queue)
return url
return None
🔍 Duplicate Detection
Content Deduplication:
import hashlib
class ContentDeduplicator:
def __init__(self):
self.content_hashes = set()
def is_duplicate(self, content):
content_hash = hashlib.sha256(content.encode()).hexdigest()
if content_hash in self.content_hashes:
return True
self.content_hashes.add(content_hash)
return False
URL Deduplication with Bloom Filter:
import mmh3
from bitarray import bitarray
class BloomFilter:
def __init__(self, size=1000000, hash_count=3):
self.size = size
self.hash_count = hash_count
self.bit_array = bitarray(size)
self.bit_array.setall(0)
def add(self, item):
for i in range(self.hash_count):
index = mmh3.hash(item, i) % self.size
self.bit_array[index] = 1
def contains(self, item):
for i in range(self.hash_count):
index = mmh3.hash(item, i) % self.size
if self.bit_array[index] == 0:
return False
return True
🗄️ Storage Systems
Content Storage Architecture:
class ContentStorage:
def __init__(self):
self.disk_storage = DiskStorage()
self.cache = MemoryCache()
def store(self, url, content):
# Store on disk for persistence
self.disk_storage.save(url, content)
# Cache popular content
if self.is_popular(url):
self.cache.set(url, content)
def retrieve(self, url):
# Try cache first
content = self.cache.get(url)
if content:
return content
# Fallback to disk
return self.disk_storage.load(url)
🤖 Robots.txt Handling
import urllib.robotparser
class RobotsChecker:
def __init__(self):
self.robots_cache = {}
def can_crawl(self, url, user_agent="*"):
parsed_url = urlparse(url)
robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
if robots_url not in self.robots_cache:
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
self.robots_cache[robots_url] = rp
except:
return True # Allow if robots.txt unavailable
return self.robots_cache[robots_url].can_fetch(user_agent, url)
5. Advanced Considerations
🌐 Distributed Crawling
Consistent Hashing for URL Distribution:
import hashlib
class DistributedCrawler:
def __init__(self, crawler_nodes):
self.nodes = sorted(crawler_nodes)
def get_crawler_for_url(self, url):
url_hash = int(hashlib.md5(url.encode()).hexdigest(), 16)
node_index = url_hash % len(self.nodes)
return self.nodes[node_index]
def should_crawl_url(self, url, current_node):
assigned_node = self.get_crawler_for_url(url)
return assigned_node == current_node
📊 Performance Optimization
Connection Pooling:
import requests
from requests.adapters import HTTPAdapter
class OptimizedDownloader:
def __init__(self):
self.session = requests.Session()
adapter = HTTPAdapter(pool_connections=100, pool_maxsize=100)
self.session.mount('http://', adapter)
self.session.mount('https://', adapter)
Asynchronous Crawling:
import asyncio
import aiohttp
class AsyncCrawler:
def __init__(self, max_concurrent=100):
self.semaphore = asyncio.Semaphore(max_concurrent)
async def crawl_url(self, session, url):
async with self.semaphore:
try:
async with session.get(url) as response:
return await response.text()
except Exception as e:
logging.error(f"Failed to crawl {url}: {e}")
return None
async def crawl_urls(self, urls):
async with aiohttp.ClientSession() as session:
tasks = [self.crawl_url(session, url) for url in urls]
return await asyncio.gather(*tasks)
🔒 Security and Safety
URL Filtering:
import re
from urllib.parse import urlparse
class URLFilter:
def __init__(self):
self.blocked_domains = {'malware.com', 'spam.net'}
self.blocked_extensions = {'.exe', '.dmg', '.zip'}
self.max_url_length = 2048
def is_valid_url(self, url):
parsed = urlparse(url)
# Check domain blacklist
if parsed.netloc.lower() in self.blocked_domains:
return False
# Check file extensions
path = parsed.path.lower()
if any(path.endswith(ext) for ext in self.blocked_extensions):
return False
# Check URL length
if len(url) > self.max_url_length:
return False
return True
📈 Monitoring and Analytics
Crawler Metrics:
import time
from collections import defaultdict
class CrawlerMetrics:
def __init__(self):
self.start_time = time.time()
self.pages_crawled = 0
self.errors = defaultdict(int)
self.response_times = []
def record_success(self, response_time):
self.pages_crawled += 1
self.response_times.append(response_time)
def record_error(self, error_type):
self.errors[error_type] += 1
def get_stats(self):
elapsed = time.time() - self.start_time
return {
'pages_per_second': self.pages_crawled / elapsed,
'total_pages': self.pages_crawled,
'error_rate': sum(self.errors.values()) / (self.pages_crawled + sum(self.errors.values())),
'avg_response_time': sum(self.response_times) / len(self.response_times)
}
🔄 Freshness and Re-crawling
Priority-Based Refresh:
import time
from enum import Enum
class PageType(Enum):
NEWS = 1 # Re-crawl every hour
SOCIAL = 6 # Re-crawl every 6 hours
GENERAL = 24 # Re-crawl every day
ARCHIVE = 168 # Re-crawl every week
class FreshnesManager:
def __init__(self):
self.last_crawled = {}
self.page_types = {}
def should_recrawl(self, url):
if url not in self.last_crawled:
return True
page_type = self.page_types.get(url, PageType.GENERAL)
hours_since_crawl = (time.time() - self.last_crawled[url]) / 3600
return hours_since_crawl >= page_type.value
def mark_crawled(self, url, page_type=PageType.GENERAL):
self.last_crawled[url] = time.time()
self.page_types[url] = page_type
6. Reference Materials
Foundational Papers
- The Anatomy of a Large-Scale Hypertextual Web Search Engine
- Mercator: A Scalable, Extensible Web Crawler
- Web Crawling Survey
Industry Implementations
- Googlebot Documentation
- Apache Nutch - Open source web crawler
- Scrapy Framework - Python crawling framework