Tech Monk
Posts
EP 2: How Large Apps Scale to Handle Millions of Users Efficiently

EP 2: How Large Apps Scale to Handle Millions of Users Efficiently

With load balancing, caching, CDNs & multi-region setups, we can build a scalable, high-performance systems for millions of users.

March 10, 2025

Hello fellow Tech Monks👋! Let’s become the “Go-To” Software Engineer who’s always “Up-To-Date” with the latest tech. Learn a new Software Engineering Concept, every week!

You can also checkout: What is Kafka?

Basic Architecture
Decoupling the Database
Introducing a Load Balancer
Scaling the Database
Caching to Improve Performance
Content Delivery Network (CDN)
- What is a CDN?
- How Does a CDN Work?
Managing User Sessions
- Monitoring System Performance with Message Queues
  - How Message Queues Work:
- Ensuring System Uptime with Multiple Data Centers

Have you ever wondered how big applications like Instagram, Netflix, and LinkedIn handle millions of users seamlessly? Building and scaling a web application that can handle millions of concurrent users is essential and it requires careful system design, ensuring reliability, speed, and efficiency.

Let’s discover in this blog the fundamental requirements and steps to scale your web application, making it capable of handling millions of concurrent users.

Basic Architecture

Let's start with a simple web application setup that everyone understands. This minimal setup consists of:

User Device: The client accessing the application.
Domain Name Server (DNS): Resolves the domain name to an IP address.
Web Server: Processes the request and delivers the content.

When a user tries to access the application, the request first goes to the DNS, which returns the IP address of the web server for the requested domain name. The user device then sends a request to this web server. The server returns the response and the user is now able to access the web contents. However, this setup is not scalable because, the single web server contains all the necessary components, including the database, backend logic, and caching mechanisms.

Decoupling the Database

As the data volume grows, the need for storage also grows and a single web server cannot handle the increasing storage requirements. To solve this problem, we separate storage into a dedicated database server. Now the server will store all the required data in this database. The type of database (SQL or NoSQL) to use depends on the application’s requirements.

Before we move further, let’s have a look on Vertical vs. Horizontal Scaling

Vertical Scaling: Adding more CPU, RAM, or storage to a single server. However, this has physical limitations because there’s a limit to which you can add CPU and RAM to a single computer.

Horizontal Scaling: Adding more servers into your pool of resources to distribute the load, making it a preferred approach for large-scale applications. Here even if one server goes down, the others can take up the requests.

Introducing a Load Balancer

So it’s evident that we will use Horizontal Scaling for big applications. But with multiple servers, we need a mechanism to distribute traffic because how do we decide which server will process the request from a specific user. This is where load balancer comes into picture! A load balancer receives user requests and directs them to the least busy server (the server with lesser load), ensuring efficient resource utilization and fault tolerance.

So in this setup, user can only interact with the load balancer, the load balancer receives the input request from the user and sends to the server, the server returns the response to the load balancer which in turn returns backs to the user. An added benefit of users only interacting with the load balancer is that the users will know the IP address of only the load balancer and not the internal servers thus preventing direct access to backend servers for better security.

Scaling the Database

Now as we can see, we have only one database instance. What if this data instance is crashed due to power outrage or something? The servers won’t be able to access the database and will not be able to server users. A single database instance is a failure point. To solve this, we can implement a similar horizontal scaling approach like the way we did for servers. So instead of having a single database instance, we can have multiple database instances.

But in databases scaling works differently, let’s see how

Master Database: Handles write requests.
Slave Databases: Handle read requests.

Among all the database instances, one instance is called master instance, which will only process write requests, and other instances are called slave instances, which will only process read requests. Data is synchronized between master and slave instances. This process is called database replication.

But there is one problem here that servers need to decide on which slave database to send request to. To avoid this, we can have a load balancer for databases as well. This load balancer will take care of taking requests from the servers and will redirect them to database instances.

Now what if the master database instance is crashed? This system will not be able to process write requests anymore. In such cases, one of the slave instances is automatically promoted to master instance. There are many algorithms to select slave instances which can be promoted to master automatically.

Caching to Improve Performance

We can see hat the server send request to databases whenever user requests for data. Querying the database for every user request is costly and increases the response time. To reduce the response time, we can set up a cache between server and databases.

A cache stores frequently accessed data thus reducing response time. It is generally stored as a key-value pair. Whenever server requires data, it can first check in the cache. If it is available in cache, then the data is extracted from cache. If it is not available in cache, then server sends a query to database and once the data is fetched it is brought into cache.

Different Caching strategies include:

Eviction Policies: Determine when to remove data from the cache.
Expiration Policies: Define how long data should be retained.

Content Delivery Network (CDN)

Imagine you're in India, trying to access a website whose servers are located in the United States. Your request must travel all the way to the U.S., and then the response must come back to India. This long travel time increases latency, leading to a poor user experience.

This is where a Content Delivery Network (CDN) comes into play.

What is a CDN?

A CDN is a globally distributed network of servers designed to deliver static content (images, video, HTML, JavaScript files, etc.,) efficiently. These CDN servers are strategically placed in multiple geographical locations to reduce the physical distance between users and the content they request.

How Does a CDN Work?

User Requests Content: When a user visits a website, their request is directed to the nearest CDN server instead of the website’s main server.
Serving from CDN (Cache Hit): If the CDN already has the requested content (such as images, videos, HTML, CSS, JavaScript), it immediately serves the content, reducing response time.
Fetching from the Main Server (Cache Miss): If the CDN doesn’t have the required content, it will fetch it from the main servers, deliver it to the user, and store it for future requests.
Time to Live (TTL) Header: The content stored in the CDN is governed by a special HTTP header called Time to Live (TTL). It how long the content remains valid in the CDN before needing to be refreshed from the main server.

How Netflix Uses CDN

CDNs are a game-changer for platforms like Netflix. When a new season of Squid Games is released, millions of users try to stream it simultaneously. Instead of overloading the main servers, Netflix uses a CDN to store episodes in different regional servers.

Now, when a user in India wants to watch Squid Games, they don’t need to fetch the episode from the U.S. Instead, they stream it from a local CDN server, ensuring faster loading and smooth playback.

Managing User Sessions

Applications like WhatsApp, LinkedIn, and Instagram rely on sessions to handle user requests efficiently. Every time you interact with these platforms, whether viewing a post, sending a message, or making a request, the system first checks if you're logged in.

But where should session data be stored?

Session data needs to be stored temporarily while it remains active. If we store this data inside a single server, there’s a major risk. What is the Server Fails? If the server storing the session data crashes, all session information is lost, forcing users to log in again.

To avoid this issue, we use shared session storage that is independent of any single server.

Maintaining a shared session storage allows all servers to access session data independently, making the system stateless.

It is beneficial because

Servers don’t need to retain user sessions, any server can handle any request.
If one server fails, requests can be processed by another without losing session data.
Easier to scale, as servers can be added or removed without affecting user sessions.

The session storage can be implemented using: Relational Databases, Caching Systems (e.g., Redis, Memcached), NoSQL Databases (preferred due to scalability)

Monitoring System Performance with Message Queues

When building large-scale systems, it's not just about handling user requests efficiently. We also need to monitor system health, log errors, and ensure high availability in case of failures by logging:

Failed requests
Resource usage
Peak traffic hours

To achieve this, we use message queues.

How Message Queues Work:

Producers (main servers) add events (e.g., logs, metrics) to a message queue.
Consumers (workers) process these events asynchronously.

In our system we can create a set of servers called workers which will login information about the system performance. Whenever our main server wants to log any performance metric, they will add an event in a message queue. The set of workers can process events one by one from the message queue.

This ensures smooth performance monitoring without affecting real-time user experience.

Example: If a server detects high CPU usage, it can log this event to a message queue. A background worker will then analyze it and take necessary actions like scaling up resources.

Ensuring System Uptime with Multiple Data Centers

Now, imagine our entire system is deployed only in Japan. What happens if a hurricane causes a complete system outage? Without backup, the whole service would go down!

The solution is to Deploy multiple data centers across different regions.

If a user in Japan tries to access the website but the Japan data center is down, their request is automatically routed to the nearest active data center (e.g., in Singapore or the US).

This ensures:

Minimal downtime
High availability
Seamless user experience, even during disasters

I hope this gives you a clear picture of designing a scalable system capable of handling millions of users efficiently, Why it is needed and How it is used.

In a nutshell, by implementing load balancing, caching, CDNs, session management, message queues, and multi-region deployments, we ensure a scalable, high-performance system that handles millions of users efficiently.

Keep learning. You’ve got this!