How do servers run non-stop 24/7, 365 days a year?

This blog post explores the principles behind servers operating stably without interruption 24/7, 365 days a year. Discover how hardware and software technologies enable uninterrupted service.

Many internet services are actively operating and developing these days. Representative internet services include social networking services like Facebook and Twitter. These services provide users with platforms for communication and information sharing, establishing themselves as essential tools in modern society. Users experience real-time news sharing and global connectivity through them. Online games and mobile games are also types of internet services. These services have evolved beyond simple entertainment into platforms where users worldwide compete and collaborate in real time. Mobile games, in particular, have gained explosive popularity due to their accessibility anytime, anywhere.
Many users have likely encountered situations where site access is slow, error pages appear, or messages indicate the server is under maintenance. This causes significant inconvenience to users, and if such problems occur at critical moments, it can lead to a loss of user trust. Users often describe these situations as ‘the server is down’. Why does the server suffer and eventually crash? And why does this result in users being unable to access the services they want?
To answer this question, we first need to understand the role of servers, the core of internet services. Servers are the central computer systems that provide services to users. They handle numerous user requests simultaneously, loading webpages and transmitting data. If a server malfunctions, users cannot properly use the service. For this reason, stable and reliable server operation is a critical factor determining the success of internet services.
Non-stop operation technology, as the name implies, is the technology that provides internet services without interruption 24 hours a day, 365 days a year. Users of internet services where non-stop operation technology is well implemented can access the service whenever they want. This maximizes user convenience while being essential for maintaining stable revenue for the service provider. The revenue of an internet service is proportional to the product of the service’s uptime and the number of users connected simultaneously. In other words, increasing either the service’s uptime or the number of concurrent users is the way to boost revenue for companies providing internet services. The latter depends on how marketing is conducted or what services are strategically designed, while the former is a challenge engineers must solve.
Non-stop operation technology broadly falls into two categories: hardware-based and software-based. In internet services, the program running on the server computer is called the server application. Here, the server computer is the hardware, and the server application is the software. Hardware-based non-stop operation technology refers to techniques that prevent stoppage by performing specific tasks on the general server computer itself. Software-based non-stop operation technology refers to techniques that prevent stoppage by performing specific tasks on the general server application itself.
How can we create a server computer that never stops? One method is to connect CPUs or hard disks in parallel. Computers can only handle 0s and 1s. Therefore, the binary system is used to represent numbers. Additionally, each character corresponds to a specific number. This is called ASCII Code. The uppercase letter A is the number 66, and B is the number 67. Therefore, all characters and numbers can be represented using 0s and 1s.
Occasionally, computers experience unintended flipping of 0s and 1s. When this happens, the computer may stop working because the intended number or character changes. Components prone to this problem are the CPU and hard disk. Simply put, the CPU is the component that performs arithmetic operations, and the hard disk is the component that stores the results. Since the CPU may reuse data stored on the hard disk, both components must produce correct results to provide normal internet services.
Two are better than one. Connect two CPUs in parallel. For any given operation, perform the calculation on both CPUs and compare the results.
If the results differ, it indicates a problem in one of them, so the operation is retried. Suppose the probability of an error in one CPU is 10%. The number of cases where the results from the two CPUs differ are (true, false) and (false, true). As mentioned earlier, if the results of the two operations differ, they are re-executed, so these two cases are not problematic. However, if the result is (false, false), unfortunately, the computer will halt. Yet, the probability of such a halt is only 1%, which is 10% squared. Since the actual error probability of a CPU is far smaller than 1%, connecting two, three, or more CPUs in parallel makes it extremely rare for the computer to halt due to a CPU issue.
The same applies to hard disks. Data stored on a hard disk can also change from 0 to 1 or from 1 to 0 at a specific moment. Typically, a hard disk has a built-in function to determine whether data is normal or abnormal. It stores the count of 1s for every consecutive 10 0s or 1s. When the computer reads this section, it compares the current count of 1s with the stored count. If they differ, it identifies the data as abnormal. However, standard hard disks lack a method to recover this data. Therefore, servers employing non-stop operation technology install multiple hard disks and store the same data across them. When the CPU requires the stored data, it changes the abnormal data to normal data before passing it to the CPU.
How should we build non-stopping server applications? First, use verification programs that can detect errors in the program early. Second, run the server application on multiple server computers. The first cause of server application stoppage is errors in the program. The second cause is updates to the server application due to adding new features to the internet service. Program errors are a persistent problem dating back to the earliest computers, not unique to internet services. Therefore, specialized verification programs exist to detect such errors early. These verification programs can prevent server application crashes to some extent.
Adding new features to an operational internet service requires stopping the server application and launching a new version with the feature applied. Since two server applications cannot run simultaneously on a single server computer, the order of shutdown and restart must be strictly followed. Users cannot access the service during the period between shutdown and the moment the new application starts. But what if we run the server application on multiple server computers? This solves the problem because while one server application is down for the new feature upgrade, another server application can handle the workload. However, this presents the challenge of implementing seamless communication between multiple server computers.
One technology used for this is ‘load balancing’. Load balancing is a technique that distributes the load across multiple servers, evenly distributing tasks to all servers to prevent any single server from becoming overloaded. This technology is particularly crucial for large-scale internet services. For example, during events where millions of users access the system simultaneously, failure to implement proper load balancing significantly increases the risk of server crashes. Therefore, load balancing is an essential element for achieving non-stop operation technology.
To implement non-stop operation technology, all the techniques mentioned above must be fundamentally applied. Services like Facebook, Twitter, and Instagram, which many people use, already have all these techniques fully implemented. In fact, even that isn’t perfect. Even when connecting CPUs or hard disks in parallel, the possibility of errors still exists, albeit at a lower rate than before. Furthermore, verification programs cannot detect all problems within server applications. Beyond the internet services mentioned above, many development companies have additionally developed and applied their own proprietary non-stop technologies. Have you ever seen Facebook’s service go down while using it? While I’ve often experienced issues like images loading slowly or posts taking a while to publish, I’ve never encountered Facebook’s servers crashing. This is the result of Facebook’s unique non-stop operation technology.
Such non-stop operation technology is directly linked to a company’s competitiveness. Uninterrupted service builds user trust, which ultimately plays a crucial role in securing loyal users. Conversely, services that frequently go down can accelerate user churn. Therefore, until both internet service providers and users can consistently deliver and receive uninterrupted service, more research into non-stop operation technology is needed. Through this, we can build a more stable and reliable internet environment.

How do servers run non-stop 24/7, 365 days a year?

About the author

Writer

About the author

Writer

Read more