URL Shortener Sistem Dizaynı

Problemin Təsviri:

Bitly və ya TinyURL kimi URL qısaltma servisi dizayn etmək lazımdır. Sistem aşağıdakı əsas komponentləri dəstəkləməlidir:

URL qısaltma və unikal short code generation
Redirection servisi
Analytics və click tracking

Functional Requirements:

Əsas Funksiyalar:

URL Qısaltma və Generation
- Long URL-i qısa URL-ə çevirmək
- Unikal short code generation
- Custom alias dəstəyi
- Expiration time setting
- QR code generation
Redirection Servisi
- Short URL-dən original URL-ə redirect
- Fast redirection (< 10ms)
- 301 vs 302 redirect choice
- Dead link handling
Analytics və Tracking
- Click count tracking
- Geographic location tracking
- Referrer tracking
- Device/browser tracking
- Time-series analytics

Non-Functional Requirements

Performance:

URL shortening latency < 100ms
Redirection latency < 10ms
99.99% uptime availability
Handle 1000+ redirects/second

Scalability:

Billions of URLs
Millions of redirects/day
Global distribution
Horizontal scaling

Capacity Estimation

Fərziyyələr:

100 milyon active URLs
10 milyon new URLs/month
100:1 read-to-write ratio
Orta URL ölçüsü: 100 bytes
Short code uzunluğu: 7 characters
Average expiration: 5 years

Storage:

URL pairs: 100M × (100 bytes + 20 bytes) = 12 GB
Analytics data: 100M × 1 KB = 100 GB
Total storage: ~112 GB (initial)
Growth: ~1.2 GB/month

QPS:

Write (URL creation): 10M/month / (30 × 24 × 3600) = ~4 QPS
Read (redirects): 4 QPS × 100 = 400 QPS
Peak QPS: ~2,000 QPS

Bandwidth:

Write: 4 QPS × 200 bytes = ~800 bytes/s
Read: 400 QPS × 200 bytes = ~80 KB/s
Total bandwidth: ~80 KB/s

High-Level System Architecture

Əsas Komponentlərin Dizaynı

1. Shortening Service

Məsuliyyətlər:

URL validation
Short code generation
URL mapping storage
Custom alias handling

Short Code Generation Approaches:

Approach 1: Hash-based (MD5/SHA256)

def generate_short_code_hash(long_url):
    # Generate hash
    hash = md5(long_url).hexdigest()
    
    # Take first 7 characters
    short_code = hash[:7]
    
    # Check collision
    if exists(short_code):
        # Append counter and rehash
        short_code = md5(long_url + str(counter)).hexdigest()[:7]
    
    return short_code

Pros: Deterministic (same URL → same code) Cons: Collision handling needed

Approach 2: Base62 Encoding

def generate_short_code_base62(id):
    # Base62: [0-9a-zA-Z]
    chars = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    base = 62
    
    short_code = ""
    while id > 0:
        short_code = chars[id % base] + short_code
        id = id // base
    
    return short_code.rjust(7, '0')

# Example: ID 125 → "0000021" (7 chars)

Pros: No collision, sequential Cons: Predictable

Approach 3: Random Generation (Recommended)

import random
import string

def generate_short_code_random():
    chars = string.ascii_letters + string.digits
    
    while True:
        short_code = ''.join(random.choices(chars, k=7))
        
        # Check uniqueness
        if not exists(short_code):
            return short_code

Pros: Unpredictable, secure Cons: Requires uniqueness check

Key Space Calculation:

62^7 = 3,521,614,606,208 (~3.5 trillion)

With 100M URLs: collision probability ≈ 0.003%

Database Schema (PostgreSQL):

urls:
  - id (PK, SERIAL)
  - short_code (UK, VARCHAR(7))
  - long_url (TEXT)
  - user_id (FK → users, nullable)
  - custom_alias (VARCHAR(50), nullable)
  - created_at (TIMESTAMP)
  - expires_at (TIMESTAMP, nullable)
  - is_active (BOOLEAN)
  - click_count (BIGINT)

CREATE INDEX idx_short_code ON urls(short_code);
CREATE INDEX idx_user_id ON urls(user_id, created_at DESC);
CREATE INDEX idx_expires_at ON urls(expires_at) WHERE expires_at IS NOT NULL;

URL Shortening Flow:

API Endpoints:

POST /api/v1/shorten
     Body: { long_url, custom_alias?, expires_at? }
     Returns: { short_url, short_code }

GET /api/v1/urls/{short_code}/info
     Returns: { long_url, created_at, click_count }

DELETE /api/v1/urls/{short_code}
     Requires: Authentication

2. Redirection Service

Məsuliyyətlər:

Short code lookup
Redirect to original URL
Click tracking
Cache management

Redirection Flow:

Redis Cache Structure:

Key: url:{short_code}
Value: {
  "long_url": "https://example.com/very/long/path",
  "expires_at": "2025-12-31T23:59:59Z"
}
TTL: 24 hours

Key: url:clicks:{short_code}
Type: Counter
Value: click_count

301 vs 302 Redirect:

Feature	301 (Permanent)	302 (Temporary)
Caching	Browsers cache	No caching
SEO	Link juice transferred	No transfer
Analytics	Limited tracking	Full tracking
Use case	Permanent URLs	Analytics needed

Implementation:

def redirect(short_code):
    # Check cache
    url_data = redis.get(f"url:{short_code}")
    
    if not url_data:
        # Cache miss - query DB
        url_data = db.query("SELECT * FROM urls WHERE short_code = ?", short_code)
        
        if not url_data:
            return "404 Not Found"
        
        # Check expiration
        if url_data.expires_at and url_data.expires_at < now():
            return "410 Gone"
        
        # Cache it
        redis.setex(f"url:{short_code}", 86400, url_data)
    
    # Async click tracking
    kafka.publish("click_events", {
        "short_code": short_code,
        "timestamp": now(),
        "user_agent": request.headers.get("User-Agent"),
        "ip": request.remote_addr,
        "referrer": request.headers.get("Referer")
    })
    
    # Increment counter
    redis.incr(f"url:clicks:{short_code}")
    
    # Redirect
    return redirect(url_data.long_url, code=302)

API Endpoints:

GET /{short_code}
    Returns: 302 Redirect

GET /qr/{short_code}
    Returns: QR code image

3. Key Generation Service

Məsuliyyətlər:

Pre-generate unique keys
Distributed key generation
Key pool management

Key Generation Strategies:

Strategy 1: Counter-based with Zookeeper

- Zookeeper maintains global counter
- Each server gets range: [1000-2000], [2001-3000]
- Convert to Base62
- No coordination needed within range

Strategy 2: Twitter Snowflake

64-bit ID:
- 41 bits: timestamp
- 10 bits: machine ID
- 12 bits: sequence

Converted to Base62 for short code

Strategy 3: Key Generation Pool (Recommended)

class KeyGenerator:
    def __init__(self):
        self.pool_size = 1000000
        self.refill_threshold = 100000
        self.generate_pool()
    
    def generate_pool(self):
        # Pre-generate 1M keys
        keys = set()
        while len(keys) < self.pool_size:
            key = generate_random_key(length=7)
            keys.add(key)
        
        # Store in Redis
        redis.sadd("key_pool", *keys)
    
    def get_key(self):
        key = redis.spop("key_pool")
        
        # Check pool size
        pool_size = redis.scard("key_pool")
        if pool_size < self.refill_threshold:
            # Async refill
            background_task(self.generate_pool)
        
        return key

Key Pool in Redis:

Key: key_pool
Type: Set
Value: [abc1234, def5678, ...]
Size: 1M keys

Refill when < 100K

4. Analytics Service

Məsuliyyətlər:

Click event processing
Geographic tracking
Time-series aggregation
Dashboard data

Database Schema (ClickHouse):

CREATE TABLE clicks (
    click_id UUID,
    short_code VARCHAR(7),
    timestamp DateTime,
    ip_address VARCHAR(45),
    country VARCHAR(2),
    city VARCHAR(100),
    referrer TEXT,
    user_agent TEXT,
    device_type VARCHAR(20),
    browser VARCHAR(50),
    os VARCHAR(50)
) ENGINE = MergeTree()
ORDER BY (short_code, timestamp);

-- Aggregated statistics
CREATE MATERIALIZED VIEW daily_stats
ENGINE = SummingMergeTree()
ORDER BY (short_code, date)
AS SELECT
    short_code,
    toDate(timestamp) as date,
    count() as click_count,
    uniq(ip_address) as unique_visitors
FROM clicks
GROUP BY short_code, date;

Analytics Worker:

def process_click_event(event):
    # Parse event
    short_code = event['short_code']
    ip = event['ip']
    user_agent = event['user_agent']
    
    # GeoIP lookup
    location = geoip.lookup(ip)
    
    # User-agent parsing
    device_info = parse_user_agent(user_agent)
    
    # Insert to ClickHouse
    clickhouse.insert({
        'click_id': uuid4(),
        'short_code': short_code,
        'timestamp': event['timestamp'],
        'ip_address': ip,
        'country': location.country,
        'city': location.city,
        'referrer': event.get('referrer'),
        'user_agent': user_agent,
        'device_type': device_info.device,
        'browser': device_info.browser,
        'os': device_info.os
    })

Analytics Queries:

-- Total clicks
SELECT count() FROM clicks 
WHERE short_code = 'abc123';

-- Clicks by country
SELECT country, count() as clicks
FROM clicks
WHERE short_code = 'abc123'
GROUP BY country
ORDER BY clicks DESC;

-- Clicks over time
SELECT 
    toDate(timestamp) as date,
    count() as clicks
FROM clicks
WHERE short_code = 'abc123'
GROUP BY date
ORDER BY date;

-- Top referrers
SELECT 
    domain(referrer) as referrer_domain,
    count() as clicks
FROM clicks
WHERE short_code = 'abc123' AND referrer != ''
GROUP BY referrer_domain
ORDER BY clicks DESC
LIMIT 10;

API Endpoints:

GET /api/v1/analytics/{short_code}/summary
    Returns: { total_clicks, unique_visitors, top_countries }

GET /api/v1/analytics/{short_code}/timeline
    Query: start_date, end_date
    Returns: { data: [{ date, clicks }] }

GET /api/v1/analytics/{short_code}/referrers
    Returns: { data: [{ referrer, clicks }] }

GET /api/v1/analytics/{short_code}/devices
    Returns: { desktop, mobile, tablet percentages }

Database Sharding Strategy

Range-based Sharding:

Shard 1: short_codes starting with [0-9, a-g]
Shard 2: short_codes starting with [h-o]
Shard 3: short_codes starting with [p-z, A-Z]

Benefits:
- Even distribution
- Easy lookup
- No hotspots

Hash-based Sharding:

Shard = hash(short_code) % num_shards

Benefits:
- Uniform distribution
- Scalable

Caching Strategy

Multi-Level Cache:

CDN Cache:
- Cache redirects at edge
- Reduce latency
- TTL: 1 hour
Redis Cache:
- URL mappings (TTL: 24h)
- Click counters (persistent)
- Key pool (persistent)
Application Cache:
- In-memory LRU cache
- Hot URLs
- Size: 10,000 entries

Cache Invalidation:

On URL deletion: invalidate immediately
On expiration: lazy deletion
On update: write-through

Rate Limiting

Per-User Limits:

Anonymous: 10 URLs/hour
Registered: 100 URLs/hour
Premium: 1000 URLs/hour

Implementation:

def check_rate_limit(user_id):
    key = f"rate_limit:shorten:{user_id}"
    
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, 3600)  # 1 hour
    
    if count > get_limit(user_id):
        raise RateLimitExceeded()
    
    return True

Security Considerations

URL validation: Prevent malicious URLs
Spam detection: ML-based filtering
Rate limiting: Prevent abuse
CAPTCHA: For anonymous users
Blacklist: Known malicious domains
HTTPS enforcement: Secure redirects
XSS prevention: Sanitize inputs

Failure Handling

Database Failures:

Read replicas for failover
Write-ahead logging
Automated backups

Cache Failures:

Fallback to database
Graceful degradation
Cache warming on recovery

Key Generation Failures:

Multiple key pools
Fallback to random generation
Monitor pool size

Monitoring və Observability

Key Metrics:

Shortening latency (p50, p95, p99)
Redirection latency
Cache hit rate
QPS (read/write)
Error rates
Key pool size

Alerts:

Shortening latency > 200ms
Redirection latency > 20ms
Cache hit rate < 90%
Key pool < 10,000
Error rate > 1%

Əlavə Təkmilləşdirmələr

Sistemə əlavə edilə biləcək feature-lər:

Link Management Dashboard: Bulk operations
A/B Testing: Multiple URLs per short code
Link Preview: OG tags for social media
Password Protection: Secure links
Branded Domains: Custom domain support
Deep Linking: Mobile app support
API Rate Limiting: Fine-grained control
Webhook Integration: Click notifications
Link Bundles: Group related URLs
UTM Parameter Injection: Auto-add tracking params