# Sharding, Replication and System Design Week 51.1 Cohort 3.0


## Stateless vs Stateful Backend

### Stateless backend

Each request is independent. The server does not keep client session state between requests.

Any server instance can handle any request if it has access to shared resources (database, cache).

### Stateful backend

Server keeps client-specific state across requests (session, in-memory game state, websocket connection state).

Requests from the same client often need the same server instance or a way to retrieve that state.

### When to choose

*   Use stateless when possible (typical web APIs, microservices, REST/HTTP services).
    
*   Use stateful when you need low-latency in-memory state or long-lived bidirectional connections and state replication is impractical.
    

### Examples

*   Stateless: REST API, microservices reading DB, serverless functions.
    
*   Stateful: multiplayer game servers, WebSocket chat servers maintaining connection-specific state, in-memory queued tasks per user.
    

### Common designs to combine strengths

*   Keep core logic stateless; store session or user state in a shared store (Redis, DB).
    
*   Use stateful servers only where needed and front them with a load balancer and sticky sessions, or use a dedicated routing layer.
    

* * *

## Stateless vs Stateful

| Aspect | Stateless | Stateful |
| --- | --- | --- |
| Request handling | Any instance can handle any request | Requests may need a specific instance |
| Horizontal scaling | Easy | Harder (needs sticky routing / replication) |
| Failure recovery | Simple (restart any instance) | Complex (must recover in-memory state) |
| Session storage | External (DB/Redis) | Often in-process memory |
| Use cases | REST APIs, microservices, serverless | Game servers, long-lived sockets, session-heavy apps |
| Best for | High scale, simple deployment | Low-latency, connection-oriented workloads |

* * *

## Scaling Node.js With the Cluster Module

### What Cluster does

Node.js Cluster runs multiple Node.js worker processes that share the same server port.

Each worker is a separate process using its own event loop and memory - lets you use multi-core machines.

master process distributes connections to workers (Node handles the LB across workers).

### Benefits

*   Simple way to utilize all CPU cores on one machine.
    
*   Auto-restart crashed workers (supervisor behavior via master).
    

### Limitations & considerations

*   Works only on a single host. To scale across machines you still need a load balancer or service mesh.
    
*   Workers do not share memory — use Redis/DB for shared state.
    
*   Not a replacement for process managers (PM2, Docker orchestration). Use cluster + process manager or container orchestrator.
    
*   Sticky sessions: if your app is stateful (needs the same worker), Node’s default distribution is not sticky. You may need an external sticky LB or a custom master distribution.
    

### Alternatives / complements

*   PM2 cluster mode (manages forks and restarts).
    
*   Worker threads (for CPU-bound tasks within one process).
    
*   Multiple containers / pods behind a load balancer for multi-host scaling (Kubernetes, ECS).
    

**Example**

## Scaling Stateless Backend: Load Balancer Patterns

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/9575ec79-fd5d-4b82-8217-85e554bc9e1b.png align="center")

### Why load balancers

*   Distribute incoming requests across multiple backend instances.
    
*   Improve availability (health checks, failover).
    
*   Centralize cross-cutting concerns (SSL termination, rate limits, authentication gateways).
    

### Types of load balancers (simple)

*   Layer 4 (Transport): TCP/UDP level; efficient, fast. e.g., many cloud LBs in TCP mode.
    
*   Layer 7 (Application): HTTP-aware; can route by URL, headers; supports SSL termination and advanced routing. e.g., NGINX, HAProxy, ALB (AWS).
    

### Key features to use for stateless services

*   **Health checks:** automatically remove unhealthy instances.
    
*   **Auto-scaling integration:** add/remove instances based on metrics.
    
*   **SSL/TLS termination:** offload crypto to LB to simplify backends.
    
*   **HTTP routing and path-based rules:** send traffic to appropriate services.
    
*   **Sticky sessions:** not needed for stateless backends — avoid when possible.
    
*   **Rate limiting + WAF:** protect backends from abuse and attacks.
    
*   **Connection reuse and keepalive:** improve latency and resource usage.
    

### Load balancing algorithms

*   **Round robin:** simple, evenly distributes.
    
*   **Least connections:** prefers less-busy instances (good for uneven request durations).
    
*   **IP hash:** deterministic routing (can be used for “soft” stickiness).
    
*   **Weighted routing:** prefer stronger/faster instances.
    

### Sticky sessions and statelessness

Avoid sticky sessions for stateless services — they reduce scheduler flexibility and can cause hot spots.

If you must support sessions: store session in a fast shared store (Redis) and keep backends stateless.

## SSL Termination on Load Balancer

## Rate Limiting on Load Balancer Level

## DDos Protection on Load Balancer Level

## Auto Scaling VS Mannual Scaling

## Sticky Backend (Sticky Connection)

## Scaling Stateful Backend Server

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/e9a2df1f-1260-402e-813f-6251e85f8928.png align="center")

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/7de53171-27e6-4aea-ac07-d7c6066c2365.png align="center")

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/8ebdd620-2ed3-492f-bbca-7d5643a58473.png align="center")

Mostly done manually

ASGs/MIGs let you do metric based scaling

K8s has HPA that lets you scale pods and nodes

Karpenter lets you scale node pools based on applications

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/ece46f85-b871-438f-bc20-1ec66d6fc9a2.png align="center")

## Stopping bot abuse and bulk buying for high-demand releases like tickets and sneaker drops

Distributed bots use many IPs, like proxies and botnets, making IP limits useless. Attackers can bypass IP or session limits with multiple accounts or stolen credentials. Bots mimic human-like requests to avoid detection.

**CAPTCHA farms** let attackers use real people to solve challenges, making low-rate defenses weak. Strict limits can hurt real users during traffic spikes, so thresholds are kept low, which attackers exploit. Attackers also spread requests over time and different endpoints to avoid simple rate limits.

### CAPTCHAs + rate limiting - why combine them

*   **Layered defense:** rate limits slow mass requests; CAPTCHAs raise attacker cost by requiring human (or costly solver) involvement.
    
*   **Adaptive application:** show CAPTCHAs only after suspicious behavior or rate-thresholds, reducing friction for normal users.
    
*   **Reduce solver ROI:** limits constrain how many CAPTCHA solves an attacker can use, making bulk buys expensive or slow.
    
*   **Complementary signals:** CAPTCHA failures, challenge frequency, and rate-limit hits feed a risk score for stronger actions (block, require verification, queue).
    

Combine adaptive rate limiting, targeted CAPTCHAs, behavioral analysis, and policies (per-user quotas, verified accounts, lotteries/queues) for effective protection without wrecking UX.

## How to Add Frontend Page Captcha with Cloudflare

Cloudflare provides an easy way to protect your frontend pages from spam, bots, fake signups, credential stuffing, and automated attacks using **Turnstile Captcha**. Unlike traditional captchas, Cloudflare Turnstile focuses on user privacy and smooth user experience.

### Steps to Add Cloudflare Turnstile Captcha

### 1\. Create a Turnstile Site

1.  Go to Cloudflare Turnstile
    
2.  Login to your Cloudflare account
    
3.  Create a new Turnstile widget
    
4.  Add your domain
    
5.  Copy:
    
    *   **Site Key**
        
    *   **Secret Key**
        

### 2\. Add Captcha Widget in Frontend

```html
<form id="signup-form">
  <input type="email" placeholder="Enter Email" />

  <div
    class="cf-turnstile"
    data-sitekey="YOUR_SITE_KEY">
  </div>

  <button type="submit">Submit</button>
</form>

<script src="https://challenges.cloudflare.com/turnstile/v0/api.js" async defer></script>
```

### 3\. Verify Captcha on Backend

Never trust only frontend validation. Always verify captcha response from backend.

Example using Node.js:

```js
const axios = require("axios");

app.post("/submit", async (req, res) => {

   const token = req.body["cf-turnstile-response"];

   const response = await axios.post(
      "https://challenges.cloudflare.com/turnstile/v0/siteverify",
      new URLSearchParams({
         secret: process.env.TURNSTILE_SECRET,
         response: token
      })
   );

   if(response.data.success){
      return res.send("Human Verified");
   }

   res.status(400).send("Captcha Failed");

});
```

* * *

### Features of Cloudflare Turnstile

*   No annoying image puzzles in many cases
    
*   Privacy-focused alternative to reCAPTCHA
    
*   Reduces fake signup attacks
    
*   Works against automated scripts and bots
    
*   Easy frontend integration
    
*   Backend verification support
    
*   Supports invisible verification mode
    

## Frontend Captcha vs Backend Captcha

Both frontend and backend captchas are important, but they serve different purposes.

| Feature | Frontend Captcha | Backend Captcha Verification |
| --- | --- | --- |
| Runs On | Browser/UI | Server |
| Purpose | User interaction validation | Actual security verification |
| Security Level | Medium | High |
| Can Be Bypassed | Yes | Harder |
| Prevents Fake Requests | Partially | Strongly |
| User Experience | Visible to user | Invisible |
| Recommended | Yes | Mandatory |

### Frontend Captcha

Frontend captcha is the visible challenge shown to the user before form submission.

Examples:

*   Checkbox verification
    
*   Image selection
    
*   Invisible behavioral detection
    

#### Limitation

Attackers can directly hit your backend API and bypass frontend checks if backend verification is missing.

### Backend Captcha Verification

Backend verification validates the captcha token with Cloudflare servers.

#### Why It Is Important

Without backend verification:

*   Anyone can fake requests
    
*   Bots can bypass frontend UI
    
*   API endpoints remain vulnerable
    

#### Best Practice

Always use:

1.  Frontend captcha widget
    
2.  Backend token verification
    
3.  Rate limiting
    
4.  WAF protection
    

Together they create layered security.

## Invisible Captchas vs Visible Captchas in Cloudflare

Cloudflare Turnstile supports both invisible and visible captcha modes.

| Feature | Invisible Captcha | Visible Captcha |
| --- | --- | --- |
| User Interaction | Minimal | Required sometimes |
| UX Experience | Smooth | Slight interruption |
| Bot Detection | Behavioral analysis | Challenge-based |
| Accessibility | Better | Moderate |
| Security | Good | Strong |
| Best Use Case | Modern apps | High-risk forms |

## Bot Protection with Cloudflare by Proxied Request

One of Cloudflare’s strongest security advantages is its **reverse proxy architecture**.

When Cloudflare proxy is enabled:

```text
User → Cloudflare → Your Server
```

Your real server IP remains hidden behind Cloudflare infrastructure.

### How Cloudflare Bot Protection Works

Cloudflare analyzes requests before they reach your server.

It checks IP reputation, request frequency, browser fingerprint, ASN/network reputation, bot signatures, JavaScript execution, TLS fingerprinting, and conducts behavioral analysis. Suspicious traffic can be blocked, challenged, rate limited, redirected, or logged.

#### Common Cloudflare Security Features

| Feature | Purpose |
| --- | --- |
| WAF | Blocks malicious requests |
| Rate Limiting | Prevents spam/flood attacks |
| Bot Protection | Detects automated traffic |
| DDoS Protection | Prevents server overload |
| SSL/TLS | Encrypts traffic |
| CDN | Improves speed globally |

* * *

### AWS Has Similar Service Called WAF

Amazon Web Services also provides security services similar to Cloudflare.

The major one is: AWS WAF helps filter and monitor HTTP requests reaching your applications.

### AWS WAF vs Cloudflare

| Feature | Cloudflare | AWS WAF |
| --- | --- | --- |
| Works As | Reverse Proxy CDN | Cloud Firewall |
| DDoS Protection | Built-in | Via AWS Shield |
| Global CDN | Included | CloudFront needed |
| Bot Protection | Strong built-in | Additional configs |
| Setup Complexity | Easier | More AWS-oriented |
| Best For | General web apps | AWS ecosystem apps |

### When to Use Cloudflare

Use Cloudflare if:

*   You want easy setup
    
*   You need CDN + security together
    
*   You want origin IP protection
    
*   You need strong bot mitigation
    
*   You want fast global delivery
    

### When to Use AWS WAF

Use AWS WAF if:

*   Your infrastructure is fully on AWS
    
*   You already use CloudFront
    
*   You need deep AWS integrations
    
*   You want custom enterprise security rules
    

### How Rate Limiting Can Be Bypassed

Rate limiting controls incoming server requests to allocate resources fairly and prevent abuse.

However, attackers can bypass limits by distributing requests across multiple IPs or using botnets.

They might also exploit flaws in rate limiting logic, like resetting counters or targeting unprotected endpoints. Understanding these vulnerabilities is key to enhancing security.

## What is CDN ?

A ***Content Delivery Network (CDN)*** is a system of distributed servers that work together to deliver web content quickly to users based on their geographic location.

By caching content closer to the user's location, CDNs reduce latency and improve load times. They also help handle large volumes of traffic and protect against certain types of cyber attacks by distributing the load across multiple servers.

This makes them essential for enhancing the performance and reliability of websites and applications.

### How to Distribute React Frontend with CDN different ways

## Scaling Databases

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/31d8c384-23f3-45de-8170-9d0526bee2f5.png align="center")

### Serverless Databases

### Managed Database

### Self Hosted Databases

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/f626b3c6-3480-462e-98fd-3437ce9cf8ad.png align="center")

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/49b0de19-1f01-45ce-84a4-360a6435ca0b.png align="center")

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/5151205e-bfc3-4cd3-a69b-f652c022b479.png align="center")

Good questions

1.  Distributed vs in mem cache
    
2.  Write first or read first
    

### In memory Cache vs Redis Cache

## Which is the Right operation ordering (client → cache → database)

Below are common candidate orders for handling a write when you have a client, a Redis cache, and a database.

*   Option A — Update the cache, then update the database, then return to the client.
    
*   Option B — Update the database, then update the cache, then return to the client.
    
*   Option C — Update the database, then invalidate (clear) the cache, then return to the client.
    
*   Option D — Invalidate (clear) the cache, then update the database, then return to the client.
    

Answer : Option D

## Indexing in Databases

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/12f5e567-d4aa-4190-aa03-6e63a5da86a7.png align="center")

Indexing is a database optimization technique used to make data searching faster. Without indexing, the database scans every row one by one, which is called a **Full Table Scan**.

With indexing, the database can directly locate the required data, similar to how we use an index page in a book to quickly find a topic.

In real-world applications containing millions of records, indexing is extremely important because users expect very fast responses.

Most production databases heavily rely on indexes for performance optimization.

### How Indexing Works

An index stores a special data structure (commonly B-Tree) that maintains references to the actual rows in sorted order. When a query searches indexed columns, the database engine uses the index instead of scanning the whole table.

**Example:**

```sql
SELECT * FROM users WHERE email = "test@gmail.com";
```

If `email` is indexed:

*   Database directly finds the row.
    

If `email` is not indexed:

*   Database checks every row manually.
    

**Common Columns Used for Indexing**

User ID

*   Email
    
*   Username
    
*   Phone Number
    
*   Frequently searched fields
    

| Advantages of Indexing | Disadvantages of Indexing |
| --- | --- |
| Faster search operations | Requires additional storage |
| Improves query performance | Insert/Update/Delete becomes slightly slower |
| Reduces response time | Too many indexes can reduce write performance |
| Useful for filtering and sorting |  |

## Database Replicas

![](https://cdn.hashnode.com/uploads/covers/65def9438b970336d9eb270d/963e9e61-edc3-4722-8d8f-2a91dad4a09d.png align="center")

Database replication means creating copies of a primary database server. These copies are called replicas or read replicas. Replication is mainly used to improve scalability, availability, and performance in large-scale systems.

In most real-world applications, read operations are much higher than write operations. To reduce load on the primary database, read requests are distributed across multiple replicas.

### Primary Database vs Read Replica

| Database Type | Purpose |
| --- | --- |
| Primary DB | Handles write operations |
| Read Replica | Handles read operations |

### 95% Reads and 5% Writes Scenario

In applications like social media platforms, where 95% of traffic involves reading data and only 5% involves writing, multiple read replicas should be created to reduce the load on the primary database.

### Replication Lag

Replication lag occurs because, in many systems, replication is asynchronous, meaning the primary database updates instantly while replicas receive updates after a short delay, which can cause replicas to become temporarily outdated during high traffic.

### What Should Be the Configuration of Read Replicas?

In a system where most operations are **read operations** (for example 95% reads and 5% writes), the number of **Read Replicas** should be configured according to the read traffic load.

The general idea is:

*   More read traffic → More read replicas
    
*   Fewer read replicas → Higher load on each replica
    

If the replicas are not sufficient to handle incoming requests, the database can experience **Replication Lag**.

### Recommended Configuration

| Traffic Type | Suggested Configuration |
| --- | --- |
| Low read traffic | 1–2 Read Replicas |
| Medium read traffic | 3–5 Read Replicas |
| Very high read traffic | Multiple replicas with load balancer |

### Real-World Example

User changes profile photo:

*   Primary DB updates immediately
    
*   Replica may still show old photo for few seconds
    

```html

<img 
  src="database-replica-system-design.png" 
  alt="System design architecture showing one primary database and multiple read replicas handling read traffic"
/>
```

| Advantages | Disadvantages |
| --- | --- |
| Handles large read traffic | Replication lag |
| Improves scalability | Increased infrastructure cost |
| Better fault tolerance | Data consistency challenges |
| Reduces database load | Replicas may temporarily show outdated data |
| Read requests can be distributed across multiple servers | More infrastructure management complexity |
| Improves application performance and response time | Network synchronization overhead between databases |

## Archiving Old Data (Cold Storage)

As applications grow, databases store massive amounts of data such as logs, analytics, transactions, and user activity. Keeping all historical data inside the main database becomes expensive and affects performance.

To solve this problem, old or rarely accessed data is moved to cheaper storage systems called cold storage or archive storage.

### AWS Glacier

A typical strategy involves storing recent data (0–1 year) in the main database and moving older data (1+ years) to Amazon S3 Glacier, which is designed for long-term storage, backup systems, and archived data, keeping the active database small and efficient.

| Advantages | Disadvantages |
| --- | --- |
| Reduces storage cost | Archived data retrieval is slower |
| Improves DB performance | Extra management complexity |
| Easier database scaling | Data recovery may take additional time |
| Better long-term storage management | Requires archive policies and monitoring |
| Keeps active database smaller and faster | Accessing old data is less convenient |
| Helps optimize production database resources | Additional storage architecture setup required |

```html
<img 
  src="aws-glacier-archive-system.png" 
  alt="Architecture diagram showing recent data in active database and old data archived to AWS Glacier"
/>
```

## Sharding in Databases

Sharding is a database scaling technique where one large database is divided into multiple smaller databases called **shards**. Each shard stores only a portion of the total data.

Sharding is used when a single database server cannot handle massive traffic, storage, or millions of users.

Large-scale companies like:

*   Instagram
    
*   Netflix
    
*   Facebook
    

use sharding to scale their systems.

### Types of Database Sharding

When a database becomes too large to handle on a single server, engineers divide the data into smaller parts called **shards**. This technique is known as **Sharding**. It helps applications scale horizontally and manage large amounts of traffic and data efficiently.

### 1\. Range-Based Sharding

In range-based sharding, data is divided according to a specific range of values such as IDs, dates, or numbers. This method is simple because records are stored in sequential order.

```text
Users 1-1000    → Shard 1
Users 1001-2000 → Shard 2
Users 2001-3000 → Shard 3
```

### Advantages and Disadvantages

| Advantages | Disadvantages |
| --- | --- |
| Simple implementation | Uneven load distribution |
| Easy to understand | One shard may become overloaded |
| Easy debugging | Hotspot problem can occur |
| Works well for ordered data | Scaling becomes difficult later |

### 2\. Hash-Based Sharding

In hash-based sharding, a hash function decides which shard will store the data. This helps distribute records more evenly across multiple servers. This technique is commonly used in high-scale systems to balance traffic.

### Advantages and Disadvantages

| Advantages | Disadvantages |
| --- | --- |
| Better load balancing | Difficult resharding |
| Prevents hotspot issues | Complex shard management |
| More even data distribution | Harder debugging |
| Useful for high-scale systems | Migration becomes difficult |

* * *

### 3\. Geo-Based Sharding

Geo-based sharding divides data according to geographic regions or user locations. Different regions use different database servers.

```text
India Users  → India Server
US Users     → US Server
Europe Users → Europe Server
```

This approach improves speed for global users by keeping data closer to their location.

### Advantages and Disadvantages

| Advantages | Disadvantages |
| --- | --- |
| Faster local response time | Difficult cross-region queries |
| Reduced latency | Complex data synchronization |
| Better user experience | Higher infrastructure complexity |
| Improved regional performance | Cross-region transactions are difficult |

## Problems with Sharding

**1\. Complex Queries**

Joining data across shards becomes difficult and slow.

**2\. Rebalancing Issues**

When one shard becomes overloaded, moving data between shards is complicated.

**3\. Cross-Shard Transactions**

Transactions involving multiple shards are difficult to maintain consistently.

**4\. Hotspot Problem**

Some shards may receive much higher traffic than others.

*Example:*

*   Viral users
    
*   Celebrity accounts
    

**5\. Increased Complexity**

Application architecture becomes harder to manage.

Need:

*   Routing logic
    
*   Monitoring
    
*   Shard management
