【Architecture】System Design - URL Shortener

Posted by 西维蜀黍 on 2023-08-20, Last Modified on 2023-09-24

Scope

Functional Requirements

  • 1

  • 2

    • Candidate: How long is the shortened URL?
    • Interviewer: As short as possible.
  • 3

    • Candidate: What characters are allowed in the shortened URL?
    • Interviewer: Shortened URL can be a combination of numbers (0-9) and characters (a-z, A-Z).
  • 4

  • Candidate: Can shortened URLs be deleted or updated?

  • Interviewer: For simplicity, let us assume shortened URLs cannot be deleted or updated.

  • 5

    • The link may be expired?
  • 6

    • Can a customer create a tiny url of his/her choice or will it always be service generated ? If user is allowed to create customer shortened links, what would be the maximum size of custom url ?
    • Yes user can create a tiny url of his/her choice. Assume maximum character limit to be 16.
  • 7

    • Do we expect service to provide metrics like most visited links ?
    • Yes. Service should also aggregate metrics like number of URL redirections per day and other analytics for targeted advertisements.

Non-Functional Requirements:

  • Service should be up and running all the time
  • URL redirection should be fast and should not degrade at any point of time (Even during peak loads)

Volumn

  • 1
    • Candidate: What is the traffic volume?
    • Interviewer: 100 million URLs are generated per day.
  • 2
    • Any SLA required? E.g., the SLA of URL redirecting
  • 3
    • Serve globally? need to deploy services to be close to users?

Here are the basic use cases:

  1. URL shortening: given a long URL => return a much shorter URL
  2. URL redirecting: given a shorter URL => redirect to the original URL
  3. High availability, scalability, and fault tolerance considerations

Estimation

  • Write operation: 100 million URLs are generated per day.
  • Write operation per second: 100 million / 24 /3600 = 1160
  • Read operation: Assuming ratio of read operation to write operation is 10:1, read operation per second: 1160 * 10 = 11,600
  • Assuming the URL shortener service will run for 10 years, this means we must support 100 million * 365 * 10 = 365 billion records.
  • Assume average URL length is 100.
  • Storage requirement over 10 years: 365 billion * 100 bytes * 10 years = 365 TB

High-level Design

Architecture

API Endpoints

URL shortening

To create a new short URL, a client sends a POST request, which contains one parameter: the original long URL. The API looks like this:

POST api/v1/data/shorten

  • request parameter: {longUrl: longURLString}
  • return shortURL

URL redirecting

To redirect a short URL to the corresponding long URL, a client sends a GET request. The API looks like this:

GET api/v1/shortUrl

  • Return longURL for HTTP redirection
  • Once the server receives a tinyurl request, it changes the short URL to the long URL with 301 redirect.

One thing worth discussing here is 301 redirect vs 302 redirect.

  • 301 redirect. A 301 redirect shows that the requested URL is “permanently” moved to the long URL. Since it is permanently redirected, the browser caches the response, and subsequent requests for the same URL will not be sent to the URL shortening service. Instead, requests are redirected to the long URL server directly.
  • 302 redirect. A 302 redirect means that the URL is “temporarily” moved to the long URL, meaning that subsequent requests for the same URL will be sent to the URL shortening service first. Then, they are redirected to the long URL server.

Each redirection method has its pros and cons. If the priority is to reduce the server load, using 301 redirect makes sense as only the first request of the same URL is sent to URL shortening servers. However, if analytics is important, 302 redirect is a better choice as it can track click rate and source of the click more easily.

Deep Dive

Data Model

In the high-level design, everything is stored in a hash table. This is a good starting point; however, this approach is not feasible for real-world systems as memory resources are limited and expensive. A better option is to store <shortURL, longURL> mapping in a relational database.

  1. User ID: A unique user id or API key to make user globally distinguishable
  2. Name: The name of the user
  3. Email: The email id of the user
  4. Creation Date: The date on which the user was registered
  5. Short Url: 6/7 character long unique short URL
  6. Original Url: The original long URL

Shortening Algorithm

Envelope Calculation

A base is a number of digits or characters that can be used to represent a particular number.

Base 10 are digits [0–9], which we use in everyday life and

base 62 are [0–9][a-z][A-Z]

Let’s do a back of the envelope calculation to find out how many characters shall we keep in our tiny url.

URL with length 5, will give 62 = ~916 Million URLs
URL with length 6, will give 62 = ~56 Billion URLs
URL with length 7, will give 62 = ~3500 Billion URLs

Since we required to produce 120 billion URLs, with 7 characters in base62 we will get ~3500 Billion URLs. Hence each of tiny url generated will have 7 characters.

Hash Function

  1. URL encoding through *base62*
  2. URL encoding through *MD5*
  3. *Key Generation Service (KGS)*

Hash value length

The hashValue consists of characters from [0-9, a-z, A-Z], containing 10 + 26 + 26 = 62 possible characters. To figure out the length of hashValue, find the smallest n such that 62^n ≥ 365 billion. The system must support up to 365 billion URLs based on the back of the envelope estimation.

Hash + Collision Resolution

To shorten a long URL, we should implement a hash function that hashes a long URL to a 7-character string. A straightforward solution is to use well-known hash functions like CRC32, MD5, or SHA-1. The following table compares the hash results after applying different hash functions on this URL: https://en.wikipedia.org/wiki/Systems_design.

The first approach is to collect the first 7 characters of a hash value; however, this method can lead to hash collisions. To resolve hash collisions, we can recursively append a new predefined string until no more collision is discovered.

The hash function must satisfy the following requirements:

  • Each longURL must be hashed to one hashValue.
  • Each hashValue can be mapped back to the longURL.

This method can eliminate collision; however, it is expensive to query the database to check if a shortURL exists for every request.

A technique called bloom filters can improve performance. A bloom filter is a space-efficient probabilistic technique to test if an element is a member of a set.

Base 62 Conversion

Base conversion is another approach commonly used for URL shorteners. Base conversion helps to convert the same number between its different number representation systems. Base 62 conversion is used as there are 62 possible characters for hashValue.

Let us use an example to explain how the conversion works: convert 11157 to base 62 representation

($11157_{10}$ represents 11157 in a base 10 system).

  • From its name, base 62 is a way of using 62 characters for encoding. The mappings are: 0-0, …, 9-9, 10-a, 11-b, …, 35-z, 36-A, …, 61-Z, where ‘a’ stands for 10, ‘Z’ stands for 61, etc.
  • $11157_{10} = 2 x 62^2 + 55 x 62^1 + 59 x 62^0 = [2, 55, 59] -> [2, T, X] in base 62$

URL Shortening

URL Redirecting

Analytics

  1. use Grafana for less-accurate analytics
  2. use a separate service to record the analytics for accurate analytics

Cache

Cache penetration

  1. Bloom-filter
  2. Cache empty results

Hot keys

Databases

  • Shard DBs to handle huge object data

Options

  • Cassandra

Wrap Up

  1. Rate limiter

    1. A potential security problem we could face is that malicious users send an overwhelmingly large number of URL shortening requests. Rate limiter helps to filter out requests based on IP address or other filtering rules.
  2. Web server scaling

    1. Since the web tier is stateless, it is easy to scale the web tier by adding or removing web servers.
  3. Database scaling

    1. Database replication and sharding are common techniques.
  4. Analytics

    1. Data is increasingly important for business success. Integrating an analytics solution to the URL shortener could help to answer important questions like how many people click on a link? When do they click the link? etc.

Reference