Gergely is responsible for the operation of the payment system within Uber. He shared many common experiences in this article and gave guidance on how to operate large distributed systems.

Editor’s note: This article is from WeChat public account “InfoQ”(ID:infoqchina) , the author Gergely Orosz, translators, planning Cai Fangfang.

In the past few years, I have been building and operating a large distributed system: Uber’s payment system. During this time, I learned a lot about the concept of distributed architecture and saw how challenging it is to build and operate a high-load, high-availability system. It is very interesting in terms of building the system itself.

It is necessary to plan how the system should handle when the traffic increases by 10 times / 100 times, even if the hardware fails to be saved, etc., solving these problems can be very rewarding. But for me personally, running a large distributed system is an eye-opening experience.

The bigger the system, the more applicable Murphy’s Law is that “the things that can go wrong will eventually go wrong.” This is especially true for scenarios where frequent system deployments, many developers collaborate to submit code, data is spread across multiple data centers, and systems are used by a large number of users around the world. In the past few years, I have experienced many different system failures, many of which are far beyond my expectation.

From the foreseeable issues – such as hardware failures or accidental release of defective code to production systems, to network fibers connected to the data center being slashed or multiple cascaded faults occurring simultaneously, many failures Some systems did not work properly, which led to the suspension of service, which ultimately caused great business impact.

This article is a summary of my experience in how to reliably operate a large distributed system during my time at Uber. My experience will be very general, and people with similar scale systems will have similar experience.

I have talked to engineers at companies like Google, Facebook, Netflix, etc. Their experiences and solutions are similar. Regardless of whether the system is running in a self-built data center (Uber is the case in most cases) or running in the cloud (Uber’s system sometimes scales to the cloud), as long as the system is similar in size, the ideas and processes in the article will apply. . However, if you want to transfer your experience to a small system or a non-core system, think twice, because it is likely to be too late.

Specifically, I talked about the following topics:

  • Monitoring

  • Duty, anomaly detection and alerting

  • Stop service and event management process

  • After the event, event review and culture of continuous improvement

  • Failover drills, planned downtime, capacity planning and black box testing

  • SLO, SLA, and corresponding reports

  • Independent SRE Team

  • Continuous investment in reliability

  • In-depth reading suggestions

01 Monitoring

To know if the system is healthy, we have to answer the question: “Is my system working properly?” To give an answer, first collect data about the critical parts of the system. For a distributed system consisting of multiple different services running on multiple servers in different data centers, it is not easy to determine which key metrics need to be monitored.

Infrastructure health monitoring: If one or more servers and virtual machines are overloaded, the distributed system will be partially downgraded. The service runs on the server, so basic server health information such as CPU usage and memory usage is worth monitoring.

Some platforms are specifically designed to do this, and support automatic extensions. Uber has a large core infrastructure team that provides underlying monitoring and alerting. Regardless of the specific implementation, you will be notified when there are problems with certain instances or infrastructure of the service, and this information must be mastered.

Service Health Monitoring: Traffic, Errors, Delays. “Is the backend service healthy?” This problem is too general. Observing the traffic, error rate, terminal delay, etc. of the terminal, these provide valuable information for service health.

I like to set up a dedicated dashboard for each of them. When building a new service, using the appropriate HTTP response mapping and monitoring the corresponding code provides a lot of information about the state of the system. For example, when the client error occurs, it returns 4XX encoding, and when the server fails, it returns 5XX encoding. This kind of monitoring is easy to construct and easy to interpret.

Monitoring system latency is worth considering. For the product, the goal is to have a good experience for most end users. But monitoring only the average latency is not enough because the average delay masks a small number of high-latency requests. So empirically, monitor P95, P99 or P999 (ie 95%,A 99% or 99.9% request) latency indicator would be better.

The numbers obtained by monitoring these indicators can answer the question: How quickly will 99% of respondents respond (P99)? 1000 people make a request, what is the slowest (P999)? If you are interested in this topic, you can read the article “latencies primer”.

I have a three-year experience in operating a large distributed system at Uber

The above graph shows the average, P95 and P99 delay metrics. Note that although the average latency is less than 1 second, the fact that 1% of the requests took two or more seconds to complete is masked by the average latency.

The topic of monitoring and observability can continue to dig deeper. There are two materials worth reading, one is Google’s book “SRE: Google Operation and Decryption”, and the other is “four golden signals” for distributed system monitoring.

You can know that if your user-oriented system only wants to monitor four metrics, then care about traffic, errors, latency, and saturation. If you like to eat fast food, you can read Cindy Sridharan’s e-book “Distributed System Observability”, which describes other useful tools, such as best practices related to event logs, indicators and tracking.

Business indicator monitoring. Monitoring services can let us know if the behavior of the service looks correct, but we don’t know if the behavior of the service is as expected, so that “business can proceed as usual.” Take the payment system as an example. A key question is: “Do people pay for a particular payment method, is it enough to complete a trip?” Identify the service-enabled business events and monitor these business events. It is one of the most important monitoring steps.

Our system has experienced several stops, and we suffered. After discovering that such a stop event has no other way to monitor, our team built a business indicator monitoring system. When the outage event occurs, all services seem to be fine, but some of the core functions of the cross-service are invalid. This monitoring is built specifically for our company and business areas. Therefore, we have to spend a lot of thought, based on Uber’s observable technology stack, and strive to customize this type of monitoring for ourselves.

02 On-duty, anomaly detection and alarm

Monitoring allows people to view the current state of the system and is a very useful tool. But to be in the systemAutomatic detection and alarms occur in the event of a fault, allowing people to act accordingly, and monitoring is only the first step.

The duty is a much broader topic: Increment magazine has done a great job in this area and has explored many aspects of the duty issue. I especially tend to use the duty as an extension of the “who builds, who is responsible” idea. Which team has built the service, and which team is responsible for the duty. Our team is responsible for the duty of the payment service we build. So whenever an alarm occurs, the duty engineer responds and processes the problem. How do we get warnings from monitoring?

Discovering anomalies from monitoring data is a huge challenge and a place for machine learning. There are many third-party services that provide anomaly detection. And fortunately for our team, our company has a machine learning team to support us, they have designed a solution for Uber use cases.

The Observability team in New York also wrote a very good article on how Uber does anomaly detection. From our team’s point of view, we will push our monitoring data to their pipeline and receive alerts of different levels of trust, and then we decide whether or not to automatically call an engineer.

When is the alarm issued? It is also a question worth exploring. Too few alarms may mean missed opportunities to deal with some major failures. If there are too many alarms, the duty engineer will not sleep. After all, people’s energy is limited. Therefore, tracking and classifying alarms and measuring signal-to-noise ratio are very important for adjusting the alarm system. Checking the alarm information and marking whether it is necessary to take appropriate actions to continuously reduce unnecessary alarms is a very good step, so that the sustainable duty behavior is in a virtuous circle.

I have a three-year experience in operating a large distributed system at Uber

Example of a duty dashboard used internally by Uber, built by the Uber Developer Experience team in Vilnius

Uber’s developer tools team in Vilnius developed a dedicated duty tool to annotate alerts and visualize shifts. Our team reviews the last week’s duty on a weekly basis, analyzes pain points, and spends time improving the watch experience. This thing will be done every week.

03 Stop service and event management process

Imagine that you are on duty this week. In the middle of the night, a phone call was sent. You should investigate whether there is a stoppage or not, and whether it will affect production. The result is that the system is reallySome of them have failed. Thankfully, monitoring and warnings do reflect the real situation of the system.

For small systems, stopping the service is not a big deal, and the duty engineer can tell what happened and why. Generally speaking, they can quickly judge the problem and solve it. For complex systems that use multiple services or microservices, many engineers are submitting code to production, so it is quite difficult to determine the root cause of the error. In this case, things can be much easier if there are some standard processes.

Add the operation manual to the alarm information and describe the simple processing steps. This is the first line of defense. If the team’s manual is written very well, then even if the duty engineer does not understand the system in depth, this is generally not a problem. The manual should be continually updated and revised as soon as new fault types occur.

When services are built by multiple departments, it is also important to communicate faults between departments. In our company, thousands of engineers work together, and they choose to deploy the service to the production environment at their own discretion, so in fact hundreds of deployments occur every hour. Deployment of one service may affect another service that does not seem to be relevant at all.

Therefore, whether there is a clear standard of fault broadcast and communication channels is the key to a successful solution. There have been several cases, and the alarms seem to have nothing to do with what I have seen before. Fortunately, I think some of the other teams have seen similar weird problems. In a large fault handling group, we quickly located the root cause and solved it. Being a team to deal with problems is always much faster than a single experienced employee.

Restore production immediately and investigate the cause afterwards. In the process of dealing with the fault, I often have “adrenalin surge” and want to solve the problem directly on the spot. In general, the fault is caused by a newly deployed code, and there are some obvious errors in the changed code. In the past, I generally didn’t roll back the code, but instead went directly to modify the code and release the new version.

But in fact, the fault of production is not solved, but the other side is to modify the defects in the code. This is not a good idea. The gains are small and the losses are great. Because the new repair code must be completed as soon as possible, it can only be tested in the production environment, but it is easier to introduce new defects, even the old faults have not been resolved, and new problems have emerged.

I have seen a series of failures caused by similar reasons. So be sure to remember to restore production first and resist the temptation to fix or investigate the root cause immediately. In fact, the next day when I came back to work and then carefully investigated the root cause, it was not a bad idea.

I have a three-year experience in operating a large distributed system at Uber

04 Post-summary, event review and culture of continuous improvement

This is about how a team can handle the failure. Will they continue to investigate? Will they stop all the production-related work at hand and do a system-level fix at an amazing cost?

After the summary is the cornerstone of building a robust system. A good after-the-fact summary is very detailed and impeccable. Uber’s after-the-fact summary template has evolved over time with multiple versions, including many sections, with event summaries, approximate impacts, key points in the event process, root cause analysis, experience summaries, and a series of follow-up Into the content.

I have a three-year experience in operating a large distributed system at Uber

An example similar to the post-mortem template I used at Uber

A good after-the-fact summary will dig deeper into the root cause of the failure and get improvements to avoid similar failures, or to quickly detect and recover the next time a similar failure occurs. I said that the in-depth excavation is not to say that the reason for this failure is that the defect introduced by the recently submitted code was not found during the code review. It is enough to use the “five whys” thinking. Skills go deeper and lead to a more meaningful conclusion. For example, the following example:

  • Why is this happening? ——> Defects are caused by a defective code submitted.

  • Why didn’t others find this problem? –> The code reviewer did not notice that this modification of the code would cause this problem.

  • Why do we rely entirely on code reviewers to find such problems? –> Because we do not have automated tests for such use cases.

  • Why are these use cases not automated? ——> It is difficult to test because there is no test account.

  • Why don’t we have a test account? ——> Because the current system does not support it.

  • Conclusion: Now that we know that the root cause of the original error is that there is no test account, we recommend that the system support the addition of a test account. As a further follow-up, write automated test cases for all future code changes.

Event review is a very important aid to the summary after the event. When many teams do a very detailed job on post-summary summaries, other teams will have a lot more to learn from and benefit from, and will try to accomplish some preventive improvements. It is important that a team wants to make others feel reliable and that the system-level improvements they propose can be implemented.

For companies that are particularly concerned about reliability, a particularly important event review will be reviewed by experienced engineers and comments on key points. Company-level technical management also needs to advance improvement measures, especially those that are time consuming and may hinder other work. A robust system can’t be built in a day, and it must be iteratively refined through continuous improvement. Iteration comes from a company-level continuous improvement culture and learns from events.

05 Failover exercises, planned downtime, capacity planning and black box testing

There are a lot of daily activities that require a lot of investment, but in order to ensure the stable operation of large distributed systems online, such investment is crucial. I first encountered this kind of thing after I joined Uber because the last company I worked for didn’t need us to do such things, both in terms of size and infrastructure.

I have always thought that the data center’s failover exercise is very meaningless, but my opinion has slowly changed after I have experienced it for a few times. I have always thought that designing a robust distributed system would be as long as the data center could recover quickly if it fails. Since the design is good, and it works in theory, why should we try it on a regular basis? In fact, the answer is related to scale, and it is also necessary to test whether the service can still be processed efficiently when the new data center traffic increases sharply.

When switching happens, the most common scenario I’ve observed is that the services in the new data center don’t have enough resources to handle all the new influx. Assume that ServiceA and ServiceB are running in two data centers respectively. Each data center runs tens of hundreds of virtual machines, the resource utilization rate is 60%, and the alarm threshold is 70%.

Now we do a switch and cut all traffic from data center A to data center B. In this case, if you do not deploy a new server, Data Center B will not be able to handle the traffic. Deploying a new server can take a long time, so the request will soon start to pile up and be discarded. Such congestion will also begin to affect other services, causing successive failures of other systems, which were not related to this handover.

I have a three-year experience in operating a large distributed system at Uber

A possible schematic of a data center failover failure

Other possible failure scenarios include routing issues, network capacity issues, backend pressure bottlenecks, and more. All reliable distributed systems need to be able to switch data centers without any impact on the user experience, so they should also be able to perform exercises. I want to emphasize “should” here because such exercises are one of the most effective ways to verify the reliability of a distributed system network.

The planned service downtime exercise is a great way to test the overall reliability of the system and is a good way to discover hidden dependencies and unreasonable and unintended use of specific systems. For services that are user-oriented and have little known dependencies, doing such exercises is relatively straightforward.

But key core systems require high online time, or are relied on by many other systems, and doing such exercises is less intuitive. But what if the core system really fails one day? It is therefore best to use a controlled exercise to verify the answer, and all teams must know that an unknown fault has occurred and is ready.

The black box test is used to verify the correctness of the system and is the closest to the end user. This test is very similar to end-to-end testing. In addition, proper black box testing for most products ensures that their investment is rewarded. Key user processes and the most common user interface test cases are good choices for black box testing: black box testing in a way that can be triggered at any time to verify that the system can normal operation.

Uber as an example, a very obvious black box test case is to detect whether the match between the rider and the driver is normal within the city. That is, in a particular city, whether the request made by the passenger through Uber can get the driver’s response and complete a passenger business. When this scenario is automated, the test can be run periodically in different cities. With a robust black box test system, it’s easy to verify that the system is working properly, at least in part. Black box testing is also useful for failover exercises and is the fastest way to get a failover state.

I have a three-year experience in operating a large distributed system at Uber

An example of a black box test in a failover exercise, manually rolling back after the fault is confirmed. The abscissa is time, and the ordinate is the number of use cases that failed to run

Capacity planning is almost as important for large distributed systems. My definition of a large distributed system is a computing and storage system that costs tens of thousands or even hundreds of thousands of dollars per month. For systems of this size, self-built systems and a fixed number of deployments are less expensive than automated extensions built on the cloud.

But the self-built system should at least consider the automatic expansion problem when the peak traffic arrives, to ensure normal and smooth processing of traffic. Another question is, how many instances should you run at least next month? What about the next quarter? What about the next year?

For systems that have matured and carefully preserved historical data, predicting future traffic is not difficult. Another important thing is the budget, picking suppliers and locking down discounts from cloud service providers. If your service is expensive and you ignore capacity planning, you’ve missed a very simple way to reduce and control spending.

06 SLO, SLA and corresponding report

SLO means Service Level Objective, a digital goal of system availability. A good practice is to define service level target SLOs for each individual service, such as capacity goals, latency, accuracy, availability, and more. These SLOs can be used to trigger an alarm. Here are some examples of service level objectives:

I have a three-year experience in operating a large distributed system at Uber

Business-level SLO or functional-level SLO is an abstraction of these services. They will include metrics that directly affect users or business. For example, a business-level goal can be defined as follows: 99.99% of emails are expected to be sent within one minute. This SLO can also correspond to service-level SLOs (such as delays in payment systems and mail receiving systems), and may be measured in other ways.

Service Level Agreement (SLA) is a broader agreement between service providers and service consumers. In general, an SLA consists of multiple SLOs.For example, “payment system 99.99% available” can be used as an SLA, which can continue to be broken down into specific SLOs for each support system.

After defining SLOs, the next step is to measure and report them. Automated measurement and reporting of SLAs and SLOs is a complex goal that conflicts with the priorities of engineering and business teams. Engineering teams are not interested because they already have different levels of monitoring and can detect faults in real time. Business teams are also not interested, and they prefer to focus on submitting features rather than on complex projects that don’t quickly impact business.

This talks about another topic: companies that have or will be operating large distributed systems are investing in dedicated teams for system reliability. Let’s talk about the site reliability engineering team.

07 Independent SRE Team

The term site reliability was probably from Google in 2003, and Google now has more than 1,500 SRE engineers. Because the task of operating the production environment is becoming more complex and more automated, it quickly became a full-time job. Engineers will slowly and fully automate production: the more critical the system, the more faults there will be, the sooner this will happen.

The fast-growing technology companies generally set up the SRE team earlier, and they will have their own evolution path. Uber’s SRE team was founded in 2015 with the task of continuously managing system complexity. Other companies may set up a dedicated infrastructure team while setting up a dedicated SRE team. When a company’s cross-team reliability work requires the time of multiple engineers, consider setting up a dedicated team to do this.

With the SRE team, they will be operationally considered to make working with large distributed systems easier. The SRE team may have standard monitoring and alerting tools. They may purchase or build their own duty-related tools to advise on best practices for duty. They will help with troubleshooting and build systems to make it easier to detect, recover, and prevent failures. They will also lead failover exercises, push black box testing, and participate in capacity planning. They drive the selection, customization, and construction of standard tools to define and measure SLOs and escalate them.

Different companies need SRE teams to address different pain points, so the SRE teams from different companies are likely to be different. Their names are often different: they might be called an operations team, a platform engineering team, or an infrastructure team. Regarding the reliability of the site, Google has released two copies of the book for free. If you want to know more about SRE, you can read it.

08 Use reliability as a continuous investment

No matter what product you build, the first version is just a